aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-06-25 16:33:47 -0700
committerBryan Newbold <bnewbold@archive.org>2020-06-25 16:33:47 -0700
commitb150a62569a972b2719da71403b744bafa4f3fb6 (patch)
treef90996f3352f21096bcb3fe4d290aa79cbe98c33
parent8fe71d3395e6d4d0aa2850945dda73bd82d57bed (diff)
downloadsandcrawler-b150a62569a972b2719da71403b744bafa4f3fb6.tar.gz
sandcrawler-b150a62569a972b2719da71403b744bafa4f3fb6.zip
2020-05_pubmed ingest notes (short)
-rw-r--r--notes/ingest/2020-05_pubmed.md10
1 files changed, 10 insertions, 0 deletions
diff --git a/notes/ingest/2020-05_pubmed.md b/notes/ingest/2020-05_pubmed.md
new file mode 100644
index 0000000..36d00a1
--- /dev/null
+++ b/notes/ingest/2020-05_pubmed.md
@@ -0,0 +1,10 @@
+
+From ARXIV-PUBMEDCENTRAL-CRAWL-2020-04, on fatcat-prod1.
+
+Test small batch:
+
+ zcat ingest_file_pmcid_20200424.json.gz | head -n200 | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Run the whole batch:
+
+ zcat ingest_file_pmcid_20200424.json.gz | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1