move a bunch of top-level files/directories to ./extra/

author: Bryan Newbold <bnewbold@archive.org> 2022-12-23 15:52:02 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2022-12-23 15:52:02 -0800
commit: f3a721a9dce8801b78f7bc31e88dc912b0ec1dba (patch)
tree: fdae9373e78671d0031f83045e6c76de9ad616cf /RUNBOOK.md
parent: 8c2c354a74064f2d66644af8f4e44d74bf322e1f (diff)
download: sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.tar.gz
sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.zip
1 files changed, 0 insertions, 44 deletions
diff --git a/RUNBOOK.md b/RUNBOOK.md
deleted file mode 100644
index 6c4165d..0000000
--- a/RUNBOOK.md
+++ /dev/null
@@ -1,44 +0,0 @@
-
-## Process Un-GROBID-ed PDFs from Wayback
-
-Sometimes ingest doesn't pick up everything, or we do some heuristic CDX
-import, and we want to run GROBID over all the PDFs that haven't been processed
-yet. Only want one CDX line per `sha1hex`.
-
-A hybrid SQL/UNIX way of generating processing list:
-
-    psql sandcrawler < /fast/sandcrawler/sql/dump_ungrobid_pdf.sql | sort -S 4G | uniq -w 40 | cut -f2 > dump_ungrobid_pdf.2020.01-27.json
-
-From here, there are two options: enqueue in Kafka and let workers run, or
-create job files and run them using local worker and GNU/parallel.
-
-#### Kafka
-
-Copy/transfer to a Kafka node; load a sample and then the whole output:
-
-    head -n1000 dump_ungrobid_pdf.2020.01-27.json | kafkacat -P -b localhost -t sandcrawler-prod.ungrobided-pg -p -1
-    cat dump_ungrobid_pdf.2020.01-27.json | kafkacat -P -b localhost -t sandcrawler-prod.ungrobided-pg -p -1
-
-#### Local JSON
-
-Older example; if this fails, need to re-run entire thing:
-
-    cat /srv/sandcrawler/tasks/regrobid_cdx.split_*.json | pv -l | parallel -j40 --linebuffer --round-robin --pipe ./grobid_tool.py --kafka-env prod --kafka-hosts wbgrp-svc350.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --kafka-mode --grobid-host http://localhost:8070 -j0 extract-json -
-
-TODO: is it possible to use job log with millions of `--pipe` inputs? That
-would be more efficient in the event of failure.
-
-## GROBID over many .zip files
-
-Want to use GNU/Parallel in a mode that will do retries well:
-
-    fd .zip /srv/sandcrawler/tasks/crossref-pre-1909-scholarly-works/ | \
-        sort | \
-        parallel -j16 --progress --joblog extract_tasks.log --resume-failed \
-        './grobid_tool.py --kafka-mode --kafka-env prod --kafka-hosts wbgrp-svc350.us.archive.org:9092,wbgrp-svc284.us.archive.org:9092,wbgrp-svc285.us.archive.org:9092 --grobid-host http://localhost:8070 extract-zipfile {}'
-
-After starting, check that messages are actually getting pushed to kafka
-(producer failures can be silent!). If anything goes wrong, run the exact same
-command again. The sort is to ensure jobs are enqueued in the same order again;
-could also dump `fd` output to a command file first.
-
author	Bryan Newbold <bnewbold@archive.org>	2022-12-23 15:52:02 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2022-12-23 15:52:02 -0800
commit	f3a721a9dce8801b78f7bc31e88dc912b0ec1dba (patch)
tree	fdae9373e78671d0031f83045e6c76de9ad616cf /RUNBOOK.md
parent	8c2c354a74064f2d66644af8f4e44d74bf322e1f (diff)
download	sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.tar.gz sandcrawler-f3a721a9dce8801b78f7bc31e88dc912b0ec1dba.zip