diff options
Diffstat (limited to 'notes/tasks')
-rw-r--r-- | notes/tasks/2021-10-29_crossref_refs_backfill.md | 54 |
1 files changed, 54 insertions, 0 deletions
diff --git a/notes/tasks/2021-10-29_crossref_refs_backfill.md b/notes/tasks/2021-10-29_crossref_refs_backfill.md new file mode 100644 index 0000000..5a5ffa2 --- /dev/null +++ b/notes/tasks/2021-10-29_crossref_refs_backfill.md @@ -0,0 +1,54 @@ + +The current sandcrawler-db crossref table was backfilled from a 2021-01 +snapshot, and has not been updated since. + +Would like to use the existing fatcat Kafka feed to keep the crossref table up +to date, and also backfill in GROBID reference parsing of all `unstructured` +references. + +Current plan is: + +1. use kafkacat CLI to dump crossref Kafka topic, from the begining of 2021 up + to some recent date +2. use `persist_tool.py`, with a large batch size (200?) to backfill this dump + into sandcrawler-db. this will update some rows multiple times (if there + have been updates) +3. dump the full crossref table, as a point-in-time snapshot +4. filter to crossref records that have `unstrutured` references in them (at + all) +5. use `grobid_tool.py` with `parallel` to batch process references +6. backfill these refs using a simple SQL COPY statement +7. deploy crossref persist worker, with ref updates on, and roll the consumer + group back to date of dump +8. wait for everything to catch up + + +## Commands + +Get a timestamp in milliseconds: + + 2021-01-01 is: + 1609488000 in unix time (seconds) + 1609488000000 in miliseconds + +Hrm, oldest messages seem to actually be from 2021-04-28T19:21:10Z though. Due +to topic compaction? Yup, we have a 180 day compaction policy on that topic, +probably from when kafka space was tight. Oh well! + +Updated retention for this topic to `46656000000` (~540 days, ~18 months) using +`kafka-manager` web app. + + kafkacat -C -b wbgrp-svc263.us.archive.org -t fatcat-prod.api-crossref -o s@1609488000000 \ + | pv -l \ + | gzip \ + > crossref_feed_start20210428_end20211029.json.gz + +This resulted in ~36 million rows, 46GB. + +`scp` that around, then run persist on `sandcrawler-db`: + + # in pipenv, as sandcrawler user + zcat /srv/sandcrawler/tasks/crossref_feed_start20210428_end20211029.json.gz \ + | pv -l \ + | ./persist_tool crossref - + |