aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-04-13 13:20:47 -0700
committerBryan Newbold <bnewbold@archive.org>2020-04-13 13:20:47 -0700
commitdc0841329257f037260b225b66ef80a73fbebea7 (patch)
treef7e856b5d557f9ce25815a76300226ab64fa01b7 /notes/ingest
parent833487810b2e72ed6e22ce68dd1655bad1e87be0 (diff)
downloadsandcrawler-dc0841329257f037260b225b66ef80a73fbebea7.tar.gz
sandcrawler-dc0841329257f037260b225b66ef80a73fbebea7.zip
MAG import notes
Diffstat (limited to 'notes/ingest')
-rw-r--r--notes/ingest/2020-03-04_mag.md13
1 files changed, 13 insertions, 0 deletions
diff --git a/notes/ingest/2020-03-04_mag.md b/notes/ingest/2020-03-04_mag.md
index a5624c2..97594c8 100644
--- a/notes/ingest/2020-03-04_mag.md
+++ b/notes/ingest/2020-03-04_mag.md
@@ -393,3 +393,16 @@ heritrix):
# in sandcrawler pipenv
./scripts/ingestrequest_row2json.py /grande/snapshots/mag_nocapture_20200313.rows.json > /grande/snapshots/mag_nocapture_20200313.json
+
+## Bulk Ingest of Heritrix Content
+
+Small sample:
+
+ head -n 1000 mag_nocapture_20200313.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Full run:
+
+ cat mag_nocapture_20200313.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+ 2020-04-07 12:19 (pacific): 11,703,871
+