diff options
Diffstat (limited to 'notes/bulk_edits/2020-12-01_orcid.md')
-rw-r--r-- | notes/bulk_edits/2020-12-01_orcid.md | 55 |
1 files changed, 0 insertions, 55 deletions
diff --git a/notes/bulk_edits/2020-12-01_orcid.md b/notes/bulk_edits/2020-12-01_orcid.md deleted file mode 100644 index b6883b17..00000000 --- a/notes/bulk_edits/2020-12-01_orcid.md +++ /dev/null @@ -1,55 +0,0 @@ - -Another annual ORCID dump, basically the same as last year (2019). Expecting -around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5 -million new creator entities. - -In particular motivated to run this import before a potential dblp import -and/or creator creation run. - -Files download from: - -- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970> -- <https://archive.org/details/orcid-dump-2020> - -## Prep - - wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar - - java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz - - tar xvf ORCID_2020_10_summaries_json.tar.gz - - fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz - - zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz - - ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz - -## Import - -Fetch to prod machine: - - wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz - wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz - -Sample: - - export FATCAT_AUTH_WORKER_ORCID=[...] - zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid - - => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0}) - -Bulk import: - - export FATCAT_AUTH_WORKER_ORCID=[...] - time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid - - => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0}) - => (8x of the above, roughly) - - real 88m40.960s - user 389m35.344s - sys 23m18.396s - - - Before: Size: 673.36G - After: Size: 675.55G - |