diff options
Diffstat (limited to 'notes/tasks/2020-01-27_grobid_backfill.md')
-rw-r--r-- | notes/tasks/2020-01-27_grobid_backfill.md | 40 |
1 files changed, 40 insertions, 0 deletions
diff --git a/notes/tasks/2020-01-27_grobid_backfill.md b/notes/tasks/2020-01-27_grobid_backfill.md new file mode 100644 index 0000000..d70e203 --- /dev/null +++ b/notes/tasks/2020-01-27_grobid_backfill.md @@ -0,0 +1,40 @@ + +Recently added a bunch of PDFs to sandcrawler-db. Want to GROBID extract the +~15m which haven't been processed yet. Also want to re-GROBID a batch of +PDFs-in-zipfiles from archive.org; will probably also want to re-GROBID other +petabox files soon. + +## pre-1923 zipfile re-extraction + +Exact commands (in parallel): + + fd .zip /srv/sandcrawler/tasks/crossref-pre-1909-scholarly-works/ | \ + parallel -j16 --progress --joblog extract_tasks.log --resume-failed \ + './grobid_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --grobid-host http://localhost:8070 extract-zipfile {}' + + fd .zip /srv/sandcrawler/tasks/crossref-pre-1923-scholarly-works/ | \ + parallel -j16 --progress --joblog extract_tasks_1923.log --resume-failed \ + './grobid_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --grobid-host http://localhost:8070 extract-zipfile {}' + +## petabox re-extraction + +This was run around 2020-02-03. There are a few million remaining PDFs that +have only partial file metadata (`file_meta`), meaning run with old version of +sandcrawler code. Want to get them all covered, maybe even DELETE the missing +ones, so re-grobiding petabox-only files. + +There are about 2,887,834 files in petabox, only 46,232 need re-processing (!). + + psql sandcrawler < dump_regrobid_pdf_petabox.sql + cat dump_regrobid_pdf_petabox.2020-02-03.json | sort -S 4G | uniq -w 40 | cut -f2 > dump_regrobid_pdf_petabox.2020-02-03.uniq.json + +This is pretty few... maybe even would have been caught by wayback backfill? + +Small start: + + head /srv/sandcrawler/tasks/dump_regrobid_pdf_petabox.2020-02-03.uniq.json | ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json - + +Full batch, 25x parallel: + + cat /srv/sandcrawler/tasks/dump_regrobid_pdf_petabox.2020-02-03.uniq.json | pv -l | parallel -j25 --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc263.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json - + |