aboutsummaryrefslogtreecommitdiffstats
path: root/extra
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2022-07-19 12:52:16 -0700
committerBryan Newbold <bnewbold@robocracy.org>2022-07-19 12:52:16 -0700
commit502f3d43184a5bfb47418e3d2d4fa3bfcb022258 (patch)
treec8fc78dc1958f11bc2f0d9d037102ee8433ebfb6 /extra
parentafa963aea3dd5ce82e503401ae9af1af68140206 (diff)
downloadfatcat-502f3d43184a5bfb47418e3d2d4fa3bfcb022258.tar.gz
fatcat-502f3d43184a5bfb47418e3d2d4fa3bfcb022258.zip
2021 ORCID metadata import notes
Diffstat (limited to 'extra')
-rw-r--r--extra/bulk_edits/2022-07-12_orcid.md64
1 files changed, 64 insertions, 0 deletions
diff --git a/extra/bulk_edits/2022-07-12_orcid.md b/extra/bulk_edits/2022-07-12_orcid.md
new file mode 100644
index 00000000..760a16c8
--- /dev/null
+++ b/extra/bulk_edits/2022-07-12_orcid.md
@@ -0,0 +1,64 @@
+
+Annual ORCID import, using 2021 public data file. Didn't do this last year, so
+a catch-up, and will need to do another update later in 2022 (presumably in
+November/December).
+
+Not sure how many records this year. Current count on the orcid.org website is
+over 14 million ORCIDs, in July 2022.
+
+Files download from:
+
+- <https://info.orcid.org/orcids-2021-public-data-file-is-now-available>
+- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2021/16750535>
+- <https://archive.org/details/orcid-dump-2021>
+
+## Prep
+
+ ia upload orcid-dump-2021 -m collection:ia_biblio_metadata ORCID_2021_10_* orcid-logo.png
+
+ wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-3.0.7-full.jar
+
+ java -jar orcid-conversion-lib-3.0.7-full.jar --tarball -i ORCID_2021_10_summaries.tar.gz -v v3_0 -o ORCID_2021_10_summaries_json.tar.gz
+
+ tar xvf ORCID_2021_10_summaries_json.tar.gz
+
+ fd .json ORCID_2021_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2021_10_summaries.json.gz
+ # 12.6M 27:59:25 [ 125 /s]
+
+ zcat ORCID_2021_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2021_10_summaries.sample_10k.json.gz
+
+ ia upload orcid-dump-2021 ORCID_2021_10_summaries.json.gz ORCID_2021_10_summaries.sample_10k.json.gz
+
+## Import
+
+Fetch to prod machine:
+
+ wget https://archive.org/download/orcid-dump-2021/ORCID_2021_10_summaries.json.gz
+ wget https://archive.org/download/orcid-dump-2021/ORCID_2021_10_summaries.sample_10k.json.gz
+
+Sample:
+
+ export FATCAT_AUTH_WORKER_ORCID=[...]
+ zcat /srv/fatcat/datasets/ORCID_2021_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
+ # in 2020: Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0})
+ # this time: Counter({'total': 10000, 'exists': 7577, 'insert': 2191, 'skip': 232, 'update': 0})
+
+Bulk import:
+
+ export FATCAT_AUTH_WORKER_ORCID=[...]
+ time zcat /srv/fatcat/datasets/ORCID_2021_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
+ 12.6M 1:24:04 [2.51k/s]
+ Counter({'total': 1574111, 'exists': 1185437, 'insert': 347039, 'skip': 41635, 'update': 0})
+ Counter({'total': 1583157, 'exists': 1193341, 'insert': 348187, 'skip': 41629, 'update': 0})
+ Counter({'total': 1584441, 'exists': 1193385, 'insert': 349424, 'skip': 41632, 'update': 0})
+ Counter({'total': 1575971, 'exists': 1187270, 'insert': 347190, 'skip': 41511, 'update': 0})
+ Counter({'total': 1577323, 'exists': 1188892, 'insert': 346759, 'skip': 41672, 'update': 0})
+ Counter({'total': 1586719, 'exists': 1195610, 'insert': 349115, 'skip': 41994, 'update': 0})
+ Counter({'total': 1578484, 'exists': 1189423, 'insert': 347276, 'skip': 41785, 'update': 0})
+ Counter({'total': 1578728, 'exists': 1190316, 'insert': 346445, 'skip': 41967, 'update': 0})
+
+ real 84m5.297s
+ user 436m26.428s
+ sys 41m36.959s
+
+Roughly 2.7 million new ORCIDs, great!