After the broad datacite crawl, want to ingest paper PDFs into fatcat. But many of the DOIs are for, eg, datasets, and don't want to waste time on those. Instead of using full ingest request file from the crawl, will generate a new ingest request file using `fatcat_ingest.py` and set that up for bulk crawling. ## Generate Requests ./fatcat_ingest.py --allow-non-oa --release-types article-journal,paper-conference,article,report,thesis,book,chapter query "doi_registrar:datacite" | pv -l > /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json => Expecting 8905453 release objects in search queries => 8.91M 11:49:50 [ 209 /s] => Counter({'elasticsearch_release': 8905453, 'ingest_request': 8905453, 'estimate': 8905453}) ## Bulk Ingest cat /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1