aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2020-04-07_datacite.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/ingest/2020-04-07_datacite.md')
-rw-r--r--notes/ingest/2020-04-07_datacite.md18
1 files changed, 0 insertions, 18 deletions
diff --git a/notes/ingest/2020-04-07_datacite.md b/notes/ingest/2020-04-07_datacite.md
deleted file mode 100644
index b0217f0..0000000
--- a/notes/ingest/2020-04-07_datacite.md
+++ /dev/null
@@ -1,18 +0,0 @@
-
-After the broad datacite crawl, want to ingest paper PDFs into fatcat. But many
-of the DOIs are for, eg, datasets, and don't want to waste time on those.
-
-Instead of using full ingest request file from the crawl, will generate a new
-ingest request file using `fatcat_ingest.py` and set that up for bulk crawling.
-
-## Generate Requests
-
- ./fatcat_ingest.py --allow-non-oa --release-types article-journal,paper-conference,article,report,thesis,book,chapter query "doi_registrar:datacite" | pv -l > /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json
- => Expecting 8905453 release objects in search queries
- => 8.91M 11:49:50 [ 209 /s]
- => Counter({'elasticsearch_release': 8905453, 'ingest_request': 8905453, 'estimate': 8905453})
-
-## Bulk Ingest
-
- cat /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1
-