diff options
author | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@robocracy.org> | 2021-11-29 14:34:02 -0800 |
commit | c32154f2875a7fb9aac727013e1475cdd811e180 (patch) | |
tree | f0e061498a101fa824995fb6ec9f91e7e44257e1 /extra/bulk_edits/2020-12-01_orcid.md | |
parent | c5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff) | |
download | fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip |
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'extra/bulk_edits/2020-12-01_orcid.md')
-rw-r--r-- | extra/bulk_edits/2020-12-01_orcid.md | 55 |
1 files changed, 55 insertions, 0 deletions
diff --git a/extra/bulk_edits/2020-12-01_orcid.md b/extra/bulk_edits/2020-12-01_orcid.md new file mode 100644 index 00000000..b6883b17 --- /dev/null +++ b/extra/bulk_edits/2020-12-01_orcid.md @@ -0,0 +1,55 @@ + +Another annual ORCID dump, basically the same as last year (2019). Expecting +around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5 +million new creator entities. + +In particular motivated to run this import before a potential dblp import +and/or creator creation run. + +Files download from: + +- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970> +- <https://archive.org/details/orcid-dump-2020> + +## Prep + + wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar + + java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz + + tar xvf ORCID_2020_10_summaries_json.tar.gz + + fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz + + zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz + + ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz + +## Import + +Fetch to prod machine: + + wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz + wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz + +Sample: + + export FATCAT_AUTH_WORKER_ORCID=[...] + zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid - + => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0}) + +Bulk import: + + export FATCAT_AUTH_WORKER_ORCID=[...] + time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid - + => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0}) + => (8x of the above, roughly) + + real 88m40.960s + user 389m35.344s + sys 23m18.396s + + + Before: Size: 673.36G + After: Size: 675.55G + |