Another annual ORCID dump, basically the same as last year (2019). Expecting around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5 million new creator entities. In particular motivated to run this import before a potential dblp import and/or creator creation run. Files download from: - <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970> - <https://archive.org/details/orcid-dump-2020> ## Prep wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz tar xvf ORCID_2020_10_summaries_json.tar.gz fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz ## Import Fetch to prod machine: wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz Sample: export FATCAT_AUTH_WORKER_ORCID=[...] zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid - => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0}) Bulk import: export FATCAT_AUTH_WORKER_ORCID=[...] time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid - => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0}) => (8x of the above, roughly) real 88m40.960s user 389m35.344s sys 23m18.396s Before: Size: 673.36G After: Size: 675.55G