2020-04 datacite ingest (in progress)

author: Bryan Newbold <bnewbold@archive.org> 2020-04-15 12:39:40 -0700
committer: Bryan Newbold <bnewbold@archive.org> 2020-04-15 12:39:40 -0700
commit: 3ca6f8cf8f99739af5a830af0ddc021bb69a7706 (patch)
tree: cecaed09063ffb1a48fd7cec716e691e57ff956f
parent: b1a68eefd96025ffe81a4e8e05b23dd71b9a6d92 (diff)
download: sandcrawler-3ca6f8cf8f99739af5a830af0ddc021bb69a7706.tar.gz
sandcrawler-3ca6f8cf8f99739af5a830af0ddc021bb69a7706.zip
1 files changed, 18 insertions, 0 deletions
diff --git a/notes/ingest/2020-04-07_datacite.md b/notes/ingest/2020-04-07_datacite.md
new file mode 100644
index 0000000..b0217f0
--- /dev/null
+++ b/notes/ingest/2020-04-07_datacite.md
@@ -0,0 +1,18 @@
+
+After the broad datacite crawl, want to ingest paper PDFs into fatcat. But many
+of the DOIs are for, eg, datasets, and don't want to waste time on those.
+
+Instead of using full ingest request file from the crawl, will generate a new
+ingest request file using `fatcat_ingest.py` and set that up for bulk crawling.
+
+## Generate Requests
+
+    ./fatcat_ingest.py --allow-non-oa --release-types article-journal,paper-conference,article,report,thesis,book,chapter query "doi_registrar:datacite" | pv -l > /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json
+    => Expecting 8905453 release objects in search queries
+    => 8.91M 11:49:50 [ 209 /s]
+    => Counter({'elasticsearch_release': 8905453, 'ingest_request': 8905453, 'estimate': 8905453})
+
+## Bulk Ingest
+
+    cat /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
+
author	Bryan Newbold <bnewbold@archive.org>	2020-04-15 12:39:40 -0700
committer	Bryan Newbold <bnewbold@archive.org>	2020-04-15 12:39:40 -0700
commit	3ca6f8cf8f99739af5a830af0ddc021bb69a7706 (patch)
tree	cecaed09063ffb1a48fd7cec716e691e57ff956f
parent	b1a68eefd96025ffe81a4e8e05b23dd71b9a6d92 (diff)
download	sandcrawler-3ca6f8cf8f99739af5a830af0ddc021bb69a7706.tar.gz sandcrawler-3ca6f8cf8f99739af5a830af0ddc021bb69a7706.zip