summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2019-12-20_orcid.md
blob: 33dde32f9bc3cf53d1563edaba36d62665bc327f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

Newer ORCID dumps are XML, not JSON. But there is a conversion tool!

    https://github.com/ORCID/orcid-conversion-lib

Commands:

    wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar
    java -jar orcid-conversion-lib-0.0.2-full.jar OPTIONS

    java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2019_summaries.tar.gz -v v3_0rc1 -o ORCID_2019_summaries_json.tar.gz

    # [...]
    # Sat Dec 21 04:43:50 UTC 2019 done 7300000
    # Sat Dec 21 04:44:08 UTC 2019 done 7310000
    # Sat Dec 21 04:44:17 UTC 2019 finished  errors 0

Importing in QA, ran in to some lines like:

    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-0014-6598","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-3750-5654","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0003-1424-4826","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}
    {"response-code":409,"developer-message":"409 Conflict: The ORCID record is locked and cannot be edited. ORCID https://orcid.org/0000-0002-5340-9665","user-message":"The ORCID record is locked.","error-code":9018,"more-info":"https://members.orcid.org/api/resources/troubleshooting"}

Needed to patch to filter those out. Then ran ok like:

    zcat /srv/fatcat/datasets/ORCID_2019_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
    Counter({'total': 10000, 'exists': 5323, 'insert': 4493, 'skip': 184, 'skip-no-person': 160, 'update': 0})

New dump is about 7.3 million rows, so expecting about 3.2 million new
entities, 250k skips.

Doing bulk run like:

    time zcat /srv/fatcat/datasets/ORCID_2019_summaries.json.gz | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -

Prod timing:

    Counter({'total': 910643, 'exists': 476812, 'insert': 416583, 'skip': 17248, 'update': 0})

    real    47m27.658s
    user    245m44.272s
    sys     14m50.836s