From 3ca6f8cf8f99739af5a830af0ddc021bb69a7706 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Wed, 15 Apr 2020 12:39:40 -0700 Subject: 2020-04 datacite ingest (in progress) --- notes/ingest/2020-04-07_datacite.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 notes/ingest/2020-04-07_datacite.md (limited to 'notes/ingest') diff --git a/notes/ingest/2020-04-07_datacite.md b/notes/ingest/2020-04-07_datacite.md new file mode 100644 index 0000000..b0217f0 --- /dev/null +++ b/notes/ingest/2020-04-07_datacite.md @@ -0,0 +1,18 @@ + +After the broad datacite crawl, want to ingest paper PDFs into fatcat. But many +of the DOIs are for, eg, datasets, and don't want to waste time on those. + +Instead of using full ingest request file from the crawl, will generate a new +ingest request file using `fatcat_ingest.py` and set that up for bulk crawling. + +## Generate Requests + + ./fatcat_ingest.py --allow-non-oa --release-types article-journal,paper-conference,article,report,thesis,book,chapter query "doi_registrar:datacite" | pv -l > /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json + => Expecting 8905453 release objects in search queries + => 8.91M 11:49:50 [ 209 /s] + => Counter({'elasticsearch_release': 8905453, 'ingest_request': 8905453, 'estimate': 8905453}) + +## Bulk Ingest + + cat /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1 + -- cgit v1.2.3