aboutsummaryrefslogtreecommitdiffstats
path: root/notes/ingest/2020-04-07_datacite.md
blob: b0217f05f8610fd79e780fb826307b504ed2231b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

After the broad datacite crawl, want to ingest paper PDFs into fatcat. But many
of the DOIs are for, eg, datasets, and don't want to waste time on those.

Instead of using full ingest request file from the crawl, will generate a new
ingest request file using `fatcat_ingest.py` and set that up for bulk crawling.

## Generate Requests

    ./fatcat_ingest.py --allow-non-oa --release-types article-journal,paper-conference,article,report,thesis,book,chapter query "doi_registrar:datacite" | pv -l > /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json
    => Expecting 8905453 release objects in search queries
    => 8.91M 11:49:50 [ 209 /s]
    => Counter({'elasticsearch_release': 8905453, 'ingest_request': 8905453, 'estimate': 8905453})

## Bulk Ingest

    cat /srv/fatcat/snapshots/datacite_papers_20200407.ingest_request.json | rg -v "\\\\" | jq . -c | kafkacat -P -b wbgrp-svc263.us.archive.org -t sandcrawler-prod.ingest-file-requests-bulk -p -1