aboutsummaryrefslogtreecommitdiffstats
path: root/python/README_harvest.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2018-11-15 12:21:45 -0800
committerBryan Newbold <bnewbold@robocracy.org>2018-11-15 12:21:45 -0800
commitf21d28315aa632cdb9f84ea8787762d1e27b4310 (patch)
tree58c6ad0d34260e1d656247ddffa8ee047a8eb520 /python/README_harvest.md
parent5c47be5b0468c13db868548dccfdf1af50813b0c (diff)
downloadfatcat-f21d28315aa632cdb9f84ea8787762d1e27b4310.tar.gz
fatcat-f21d28315aa632cdb9f84ea8787762d1e27b4310.zip
refactoring harvesters
Diffstat (limited to 'python/README_harvest.md')
-rw-r--r--python/README_harvest.md21
1 files changed, 21 insertions, 0 deletions
diff --git a/python/README_harvest.md b/python/README_harvest.md
new file mode 100644
index 00000000..e308b90c
--- /dev/null
+++ b/python/README_harvest.md
@@ -0,0 +1,21 @@
+
+## State Refactoring
+
+Harvesters should/will work on fixed window sizes.
+
+Serialize state as JSON, publish to a state topic. On load, iterate through the
+full state topic to construct recent history, and prepare a set of windows that
+need harvesting, then iterate over these.
+
+If running as continuous process, will retain state and don't need to
+re-iterate; if cron/one-off, do need to re-iterate.
+
+To start, do even OAI-PMH as dates.
+
+## "Bootstrapping" with bulk metadata
+
+1. start continuous update harvesting at time A
+2. do a bulk dump starting at time B1 (later than A, with a margin), completing at B2
+3. with database starting from scratch at C (after B2), load full bulk
+ snapshot, then run all updates since A
+