diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2018-11-15 12:21:45 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2018-11-15 12:21:45 -0800 |
commit | f21d28315aa632cdb9f84ea8787762d1e27b4310 (patch) | |
tree | 58c6ad0d34260e1d656247ddffa8ee047a8eb520 /python/README_harvest.md | |
parent | 5c47be5b0468c13db868548dccfdf1af50813b0c (diff) | |
download | fatcat-f21d28315aa632cdb9f84ea8787762d1e27b4310.tar.gz fatcat-f21d28315aa632cdb9f84ea8787762d1e27b4310.zip |
refactoring harvesters
Diffstat (limited to 'python/README_harvest.md')
-rw-r--r-- | python/README_harvest.md | 21 |
1 files changed, 21 insertions, 0 deletions
diff --git a/python/README_harvest.md b/python/README_harvest.md new file mode 100644 index 00000000..e308b90c --- /dev/null +++ b/python/README_harvest.md @@ -0,0 +1,21 @@ + +## State Refactoring + +Harvesters should/will work on fixed window sizes. + +Serialize state as JSON, publish to a state topic. On load, iterate through the +full state topic to construct recent history, and prepare a set of windows that +need harvesting, then iterate over these. + +If running as continuous process, will retain state and don't need to +re-iterate; if cron/one-off, do need to re-iterate. + +To start, do even OAI-PMH as dates. + +## "Bootstrapping" with bulk metadata + +1. start continuous update harvesting at time A +2. do a bulk dump starting at time B1 (later than A, with a margin), completing at B2 +3. with database starting from scratch at C (after B2), load full bulk + snapshot, then run all updates since A + |