summaryrefslogtreecommitdiffstats
path: root/extra/bulk_edits/2020-12-01_orcid.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-29 14:34:02 -0800
commitc32154f2875a7fb9aac727013e1475cdd811e180 (patch)
treef0e061498a101fa824995fb6ec9f91e7e44257e1 /extra/bulk_edits/2020-12-01_orcid.md
parentc5ea2dba358624f4c14da0a1a988ae14d0edfd59 (diff)
downloadfatcat-c32154f2875a7fb9aac727013e1475cdd811e180.tar.gz
fatcat-c32154f2875a7fb9aac727013e1475cdd811e180.zip
move notes/bulk_edits/ to extra/bulk_edits/
Diffstat (limited to 'extra/bulk_edits/2020-12-01_orcid.md')
-rw-r--r--extra/bulk_edits/2020-12-01_orcid.md55
1 files changed, 55 insertions, 0 deletions
diff --git a/extra/bulk_edits/2020-12-01_orcid.md b/extra/bulk_edits/2020-12-01_orcid.md
new file mode 100644
index 00000000..b6883b17
--- /dev/null
+++ b/extra/bulk_edits/2020-12-01_orcid.md
@@ -0,0 +1,55 @@
+
+Another annual ORCID dump, basically the same as last year (2019). Expecting
+around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5
+million new creator entities.
+
+In particular motivated to run this import before a potential dblp import
+and/or creator creation run.
+
+Files download from:
+
+- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970>
+- <https://archive.org/details/orcid-dump-2020>
+
+## Prep
+
+ wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar
+
+ java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz
+
+ tar xvf ORCID_2020_10_summaries_json.tar.gz
+
+ fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz
+
+ zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz
+
+ ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz
+
+## Import
+
+Fetch to prod machine:
+
+ wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz
+ wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz
+
+Sample:
+
+ export FATCAT_AUTH_WORKER_ORCID=[...]
+ zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
+ => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0})
+
+Bulk import:
+
+ export FATCAT_AUTH_WORKER_ORCID=[...]
+ time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
+ => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0})
+ => (8x of the above, roughly)
+
+ real 88m40.960s
+ user 389m35.344s
+ sys 23m18.396s
+
+
+ Before: Size: 673.36G
+ After: Size: 675.55G
+