aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-02-12 19:02:12 -0800
committerBryan Newbold <bnewbold@archive.org>2020-02-12 19:02:12 -0800
commitc3a3fa053fc4a2211618a69b349c77b1a04e6b1f (patch)
tree6de6119ac7e416763ea8746454b6c70b50e71c37 /notes
parentc61cb13ae42e3a170c29d4710ea2fc484081ee96 (diff)
downloadsandcrawler-c3a3fa053fc4a2211618a69b349c77b1a04e6b1f.tar.gz
sandcrawler-c3a3fa053fc4a2211618a69b349c77b1a04e6b1f.zip
jan 2020 bulk ingest notes
Diffstat (limited to 'notes')
-rw-r--r--notes/ingest/20200114_bulk_ingests.md26
1 files changed, 26 insertions, 0 deletions
diff --git a/notes/ingest/20200114_bulk_ingests.md b/notes/ingest/20200114_bulk_ingests.md
new file mode 100644
index 0000000..9d05cda
--- /dev/null
+++ b/notes/ingest/20200114_bulk_ingests.md
@@ -0,0 +1,26 @@
+
+Generate ingest requests from arabesque:
+
+ zcat /data/arabesque/ARXIV-CRAWL-2019-10.arabesque.json.gz | ./arabesque2ingestrequest.py --link-source arxiv --extid-type arxiv --release-stage submitted - | shuf > /data/arabesque/ARXIV-CRAWL-2019-10.arabesque.ingest_request.json
+
+ zcat /data/arabesque/PUBMEDCENTRAL-CRAWL-2019-10.arabesque.json.gz | ./arabesque2ingestrequest.py --link-source pmc --extid-type pmcid - | shuf > /data/arabesque/PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json
+
+
+Quick tests locally:
+
+ time head -n100 /data/arabesque/ARXIV-CRAWL-2019-10.arabesque.ingest_request.json |./ingest_file.py requests - > sample_arxiv.json
+ time head -n100 /data/arabesque/PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json |./ingest_file.py requests - > sample_pubmed.json
+
+These are all wayback success; looking good! Single threaded, from home laptop
+(over tunnel), took about 9 minutes, or 5.5sec/pdf. That's pretty slow even
+with 30x parallelism. Should re-test on actual server. GROBID pre-check should
+help?
+
+With new bulk topic:
+
+ head PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json -n1000 | kafkacat -P -b localhost -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
+Ok, let them rip:
+
+ cat PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json -n1000 | kafkacat -P -b localhost -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+ cat ARXIV-CRAWL-2019-10.arabesque.ingest_request.json | kafkacat -P -b localhost -t sandcrawler-prod.ingest-file-requests-bulk -p -1