summaryrefslogtreecommitdiffstats
path: root/notes/bulk_edits/2020-12-01_orcid.md
blob: b6883b176e4f1d361186f313365e553951fd7814 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

Another annual ORCID dump, basically the same as last year (2019). Expecting
around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5
million new creator entities.

In particular motivated to run this import before a potential dblp import
and/or creator creation run.

Files download from:

- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970>
- <https://archive.org/details/orcid-dump-2020>

## Prep

    wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar

    java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz

    tar xvf ORCID_2020_10_summaries_json.tar.gz

    fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz

    zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz

    ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz

## Import

Fetch to prod machine:

    wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz
    wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz

Sample:

    export FATCAT_AUTH_WORKER_ORCID=[...]
    zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
    => Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0})

Bulk import:

    export FATCAT_AUTH_WORKER_ORCID=[...]
    time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
    => Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0})
    => (8x of the above, roughly)

    real    88m40.960s
    user    389m35.344s
    sys     23m18.396s


    Before: Size:  673.36G
    After:  Size:  675.55G