summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2019-12-20_orcid.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2019-12-27 11:52:31 -0800
committerBryan Newbold <bnewbold@robocracy.org>2019-12-31 12:19:29 -0800
commiteb65f50bd4980e421281a739fa93d7ff71e4cdbb (patch)
treeb089a44ca88bfc9386c9543998c8d7ffee7119e8 /notes/bulk_edits/2019-12-20_orcid.md
parent357142f45d075634b00b49dc983e0d5f394cfc72 (diff)
downloadfatcat-eb65f50bd4980e421281a739fa93d7ff71e4cdbb.tar.gz
fatcat-eb65f50bd4980e421281a739fa93d7ff71e4cdbb.zip
update bulk edit CHANGELOG and orcid notes
Diffstat (limited to 'notes/bulk_edits/2019-12-20_orcid.md')
-rw-r--r--notes/bulk_edits/2019-12-20_orcid.md43
1 files changed, 43 insertions, 0 deletions
diff --git a/notes/bulk_edits/2019-12-20_orcid.md b/notes/bulk_edits/2019-12-20_orcid.md
new file mode 100644
index 00000000..33dde32f
--- /dev/null
+++ b/notes/bulk_edits/2019-12-20_orcid.md
@@ -0,0 +1,43 @@
+
+Newer ORCID dumps are XML, not JSON. But there is a conversion tool!
+
+ https://github.com/ORCID/orcid-conversion-lib
+
+Commands:
+
+ wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar
+ java -jar orcid-conversion-lib-0.0.2-full.jar OPTIONS
+
+ java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2019_summaries.tar.gz -v v3_0rc1 -o ORCID_2019_summaries_json.tar.gz
+
+ # [...]
+ # Sat Dec 21 04:43:50 UTC 2019 done 7300000
+ # Sat Dec 21 04:44:08 UTC 2019 done 7310000
+ # Sat Dec 21 04:44:17 UTC 2019 finished errors 0
+
+Importing in QA, ran in to some lines like:
+
+ {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-0014-6598","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
+ {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-3750-5654","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
+ {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-1424-4826","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
+ {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0002-5340-9665","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
+
+Needed to patch to filter those out. Then ran ok like:
+
+ zcat /srv/fatcat/datasets/ORCID_2019_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
+ Counter({'total': 10000, 'exists': 5323, 'insert': 4493, 'skip': 184, 'skip-no-person': 160, 'update': 0})
+
+New dump is about 7.3 million rows, so expecting about 3.2 million new
+entities, 250k skips.
+
+Doing bulk run like:
+
+ time zcat /srv/fatcat/datasets/ORCID_2019_summaries.json.gz | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
+
+Prod timing:
+
+ Counter({'total': 910643, 'exists': 476812, 'insert': 416583, 'skip': 17248, 'update': 0})
+
+ real 47m27.658s
+ user 245m44.272s
+ sys 14m50.836s