blobs: start documenting seaweedfs backfill

author: Bryan Newbold <bnewbold@archive.org> 2020-05-28 14:27:41 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-05-28 14:27:41 -0700
commit: 46c422e4b6d8e6a36ea65af19afd124ab42e457c (patch)
tree: f4a2300bc92ccbe5a5c114f442262020787477f7 /blobs
parent: 5684e2e748e5ddc5962597711af1a63722c4ebde (diff)
download: sandcrawler-46c422e4b6d8e6a36ea65af19afd124ab42e457c.tar.gz
sandcrawler-46c422e4b6d8e6a36ea65af19afd124ab42e457c.zip
1 files changed, 53 insertions, 0 deletions
diff --git a/blobs/tasks.md b/blobs/tasks.md
new file mode 100644
index 0000000..34dec8f
--- /dev/null
+++ b/blobs/tasks.md
@@ -0,0 +1,53 @@
+
+## Backfill GROBID XML to Blob Store
+
+Initially ran this when spinning up new seaweedfs server to replace minio. At
+this time grobid persist worker was in db-only mode, as minio was too slow to
+accept uploads. Rough plan is to:
+
+1. run grobid persist worker from Kafka with a new temporary consumer group,
+   from the start of the GROBID output topic
+2. when it gets to end, stop the *regular* consumer group while this one is
+   still running. with temporary worker still running, at that point in time
+   entire topic should be in S3
+3. then reconfigure regular worker to db+s3 mode. halt the temporary worker,
+   restart the regular one with new config, run it indefinitely
+
+Consumer group isn't an arg, so just edit `persist_worker.py` and set it to
+`persist-grobid-seaweedfs`. Also needed to patch a bit so `--s3-only` mode
+didn't try to connect to postgresql.
+
+Commands:
+
+    ./sandcrawler_worker.py --kafka-hosts wbgrp-svc263.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only
+    => Consuming from kafka topic sandcrawler-prod.grobid-output-pg, group persist-grobid-seaweed
+    => run briefly, then kill
+
+On kafka-broker worker:
+
+    ./kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --to-earliest --group persist-grobid-seaweed --topic sandcrawler-prod.grobid-output-pg --dry-run
+
+Then run 2x instances of worker (same command as above):
+
+    ./sandcrawler_worker.py --kafka-hosts wbgrp-svc263.us.archive.org:9092 --env prod --s3-bucket sandcrawler --s3-url wbgrp-svc169.us.archive.org:8333 persist-grobid --s3-only
+
+At this point CPU-limited on this worker by the python processes (only 4 cores
+on this machine).
+
+Check in weed shell:
+
+    weed shell
+
+    > > fs.meta.cat buckets/sandcrawler/grobid/00/00/000068a76ab125389506e8834483c6ba4c73338a.tei.xml
+    [...]
+            "isGzipped": false
+    [...]
+            "mime": "application/xml",
+    [...]
+
+An open question is if we should have separate buckets per derive type. Eg, a
+GROBID XML bucket separate from thumbnails bucket. Or are prefix directories
+enough. Basically this comes down to whether we want things mixed together at
+the volume level. I think we should keep separate.
+
+Need to set the mimetype in the upload for gzip on XML?
author	Bryan Newbold <bnewbold@archive.org>	2020-05-28 14:27:41 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-05-28 14:27:41 -0700
commit	46c422e4b6d8e6a36ea65af19afd124ab42e457c (patch)
tree	f4a2300bc92ccbe5a5c114f442262020787477f7 /blobs
parent	5684e2e748e5ddc5962597711af1a63722c4ebde (diff)
download	sandcrawler-46c422e4b6d8e6a36ea65af19afd124ab42e457c.tar.gz sandcrawler-46c422e4b6d8e6a36ea65af19afd124ab42e457c.zip