diff options
author | Bryan Newbold <bnewbold@archive.org> | 2020-02-12 19:02:12 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2020-02-12 19:02:12 -0800 |
commit | c3a3fa053fc4a2211618a69b349c77b1a04e6b1f (patch) | |
tree | 6de6119ac7e416763ea8746454b6c70b50e71c37 /notes/ingest | |
parent | c61cb13ae42e3a170c29d4710ea2fc484081ee96 (diff) | |
download | sandcrawler-c3a3fa053fc4a2211618a69b349c77b1a04e6b1f.tar.gz sandcrawler-c3a3fa053fc4a2211618a69b349c77b1a04e6b1f.zip |
jan 2020 bulk ingest notes
Diffstat (limited to 'notes/ingest')
-rw-r--r-- | notes/ingest/20200114_bulk_ingests.md | 26 |
1 files changed, 26 insertions, 0 deletions
diff --git a/notes/ingest/20200114_bulk_ingests.md b/notes/ingest/20200114_bulk_ingests.md new file mode 100644 index 0000000..9d05cda --- /dev/null +++ b/notes/ingest/20200114_bulk_ingests.md @@ -0,0 +1,26 @@ + +Generate ingest requests from arabesque: + + zcat /data/arabesque/ARXIV-CRAWL-2019-10.arabesque.json.gz | ./arabesque2ingestrequest.py --link-source arxiv --extid-type arxiv --release-stage submitted - | shuf > /data/arabesque/ARXIV-CRAWL-2019-10.arabesque.ingest_request.json + + zcat /data/arabesque/PUBMEDCENTRAL-CRAWL-2019-10.arabesque.json.gz | ./arabesque2ingestrequest.py --link-source pmc --extid-type pmcid - | shuf > /data/arabesque/PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json + + +Quick tests locally: + + time head -n100 /data/arabesque/ARXIV-CRAWL-2019-10.arabesque.ingest_request.json |./ingest_file.py requests - > sample_arxiv.json + time head -n100 /data/arabesque/PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json |./ingest_file.py requests - > sample_pubmed.json + +These are all wayback success; looking good! Single threaded, from home laptop +(over tunnel), took about 9 minutes, or 5.5sec/pdf. That's pretty slow even +with 30x parallelism. Should re-test on actual server. GROBID pre-check should +help? + +With new bulk topic: + + head PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json -n1000 | kafkacat -P -b localhost -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + +Ok, let them rip: + + cat PUBMEDCENTRAL-CRAWL-2019-10.arabesque.ingest_request.json -n1000 | kafkacat -P -b localhost -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + cat ARXIV-CRAWL-2019-10.arabesque.ingest_request.json | kafkacat -P -b localhost -t sandcrawler-prod.ingest-file-requests-bulk -p -1 |