diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-06-25 16:33:47 -0700 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-06-25 16:33:47 -0700 |
commit | b150a62569a972b2719da71403b744bafa4f3fb6 (patch) | |
tree | f90996f3352f21096bcb3fe4d290aa79cbe98c33 /notes/ingest/2020-05_pubmed.md | |
parent | 8fe71d3395e6d4d0aa2850945dda73bd82d57bed (diff) | |
download | sandcrawler-b150a62569a972b2719da71403b744bafa4f3fb6.tar.gz sandcrawler-b150a62569a972b2719da71403b744bafa4f3fb6.zip |
2020-05_pubmed ingest notes (short)
Diffstat (limited to 'notes/ingest/2020-05_pubmed.md')
-rw-r--r-- | notes/ingest/2020-05_pubmed.md | 10 |
1 files changed, 10 insertions, 0 deletions
diff --git a/notes/ingest/2020-05_pubmed.md b/notes/ingest/2020-05_pubmed.md new file mode 100644 index 0000000..36d00a1 --- /dev/null +++ b/notes/ingest/2020-05_pubmed.md @@ -0,0 +1,10 @@ + +From ARXIV-PUBMEDCENTRAL-CRAWL-2020-04, on fatcat-prod1. + +Test small batch: + + zcat ingest_file_pmcid_20200424.json.gz | head -n200 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + +Run the whole batch: + + zcat ingest_file_pmcid_20200424.json.gz | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 |