aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2021-09-03 18:32:39 -0700
committerBryan Newbold <bnewbold@archive.org>2021-09-03 18:32:39 -0700
commitc3cbab57fc5b27a5add399dd27dff0a91c9d9fa1 (patch)
treeb240977857d9e3f1cf4af1a6bd3cfaa8f1516319
parentcf1bf8001d426c41143436cb578dc64d67d1ca0f (diff)
downloadsandcrawler-c3cbab57fc5b27a5add399dd27dff0a91c9d9fa1.tar.gz
sandcrawler-c3cbab57fc5b27a5add399dd27dff0a91c9d9fa1.zip
commit old arxiv ingest notes
-rw-r--r--notes/ingest/2020-11-04_arxiv.md12
1 files changed, 12 insertions, 0 deletions
diff --git a/notes/ingest/2020-11-04_arxiv.md b/notes/ingest/2020-11-04_arxiv.md
new file mode 100644
index 0000000..f9abe09
--- /dev/null
+++ b/notes/ingest/2020-11-04_arxiv.md
@@ -0,0 +1,12 @@
+
+Ran a bulk dump using fatcat ingest tool several months ago, and had Martin run
+a crawl.
+
+Crawl is now done, so going to ingest, hoping to get the majority of the
+millions of remaining arxiv.org PDFs.
+
+ zcat /grande/snapshots/fatcat_missing_arxiv_ingest_request.2020-08-21.json.gz | wc -l
+ => 1,288,559
+
+ zcat /grande/snapshots/fatcat_missing_arxiv_ingest_request.2020-08-21.json.gz | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+