From f21d28315aa632cdb9f84ea8787762d1e27b4310 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Thu, 15 Nov 2018 12:21:45 -0800 Subject: refactoring harvesters --- python/README_harvest.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 python/README_harvest.md (limited to 'python/README_harvest.md') diff --git a/python/README_harvest.md b/python/README_harvest.md new file mode 100644 index 00000000..e308b90c --- /dev/null +++ b/python/README_harvest.md @@ -0,0 +1,21 @@ + +## State Refactoring + +Harvesters should/will work on fixed window sizes. + +Serialize state as JSON, publish to a state topic. On load, iterate through the +full state topic to construct recent history, and prepare a set of windows that +need harvesting, then iterate over these. + +If running as continuous process, will retain state and don't need to +re-iterate; if cron/one-off, do need to re-iterate. + +To start, do even OAI-PMH as dates. + +## "Bootstrapping" with bulk metadata + +1. start continuous update harvesting at time A +2. do a bulk dump starting at time B1 (later than A, with a margin), completing at B2 +3. with database starting from scratch at C (after B2), load full bulk + snapshot, then run all updates since A + -- cgit v1.2.3