aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-03-17 16:36:42 -0700
committerBryan Newbold <bnewbold@archive.org>2020-03-17 16:36:45 -0700
commite1b3edd7af59fe0fd4272a4696387ea09a22a6c0 (patch)
treeb31a43dd9da0cff5a59f9684e6258b14decd087e /notes/ingest
parenteb0ccc24b63d55b437c4a67bc01fb282b2b2c698 (diff)
downloadsandcrawler-e1b3edd7af59fe0fd4272a4696387ea09a22a6c0.tar.gz
sandcrawler-e1b3edd7af59fe0fd4272a4696387ea09a22a6c0.zip
unpaywall large ingest notes
Diffstat (limited to 'notes/ingest')
-rw-r--r--notes/ingest/2020-02-14_unpaywall_ingest.md10
1 files changed, 10 insertions, 0 deletions
diff --git a/notes/ingest/2020-02-14_unpaywall_ingest.md b/notes/ingest/2020-02-14_unpaywall_ingest.md
index 0bedfdb..24779df 100644
--- a/notes/ingest/2020-02-14_unpaywall_ingest.md
+++ b/notes/ingest/2020-02-14_unpaywall_ingest.md
@@ -474,3 +474,13 @@ Note: will probably end up re-running the below after crawling+ingesting the abo
) TO '/grande/snapshots/unpaywall_fail_cookie_other_20200304.rows.json';
=> 654,885
+## Batch Ingest
+
+Test small batch:
+
+ head -n200 /grande/snapshots/unpaywall_nocapture_20200304.rows.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Full batch:
+
+ cat /grande/snapshots/unpaywall_nocapture_20200304.rows.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+