diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-03-17 16:36:42 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-03-17 16:36:45 -0700 |
commit | e1b3edd7af59fe0fd4272a4696387ea09a22a6c0 (patch) | |
tree | b31a43dd9da0cff5a59f9684e6258b14decd087e /notes | |
parent | eb0ccc24b63d55b437c4a67bc01fb282b2b2c698 (diff) | |
download | sandcrawler-e1b3edd7af59fe0fd4272a4696387ea09a22a6c0.tar.gz sandcrawler-e1b3edd7af59fe0fd4272a4696387ea09a22a6c0.zip |
unpaywall large ingest notes
Diffstat (limited to 'notes')
-rw-r--r-- | notes/ingest/2020-02-14_unpaywall_ingest.md | 10 |
1 files changed, 10 insertions, 0 deletions
diff --git a/notes/ingest/2020-02-14_unpaywall_ingest.md b/notes/ingest/2020-02-14_unpaywall_ingest.md index 0bedfdb..24779df 100644 --- a/notes/ingest/2020-02-14_unpaywall_ingest.md +++ b/notes/ingest/2020-02-14_unpaywall_ingest.md @@ -474,3 +474,13 @@ Note: will probably end up re-running the below after crawling+ingesting the abo ) TO '/grande/snapshots/unpaywall_fail_cookie_other_20200304.rows.json'; => 654,885 +## Batch Ingest + +Test small batch: + + head -n200 /grande/snapshots/unpaywall_nocapture_20200304.rows.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + +Full batch: + + cat /grande/snapshots/unpaywall_nocapture_20200304.rows.json | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + |