blob: b6883b176e4f1d361186f313365e553951fd7814 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
Another annual ORCID dump, basically the same as last year (2019). Expecting
around 10 million total ORCIDs, compared to 7.3 million last year, so maybe 2.5
million new creator entities.
In particular motivated to run this import before a potential dblp import
and/or creator creation run.
Files download from:
- <https://orcid.figshare.com/articles/dataset/ORCID_Public_Data_File_2020/13066970>
- <https://archive.org/details/orcid-dump-2020>
## Prep
wget https://github.com/ORCID/orcid-conversion-lib/raw/master/target/orcid-conversion-lib-0.0.2-full.jar
java -jar orcid-conversion-lib-0.0.2-full.jar --tarball -i ORCID_2020_10_summaries.tar.gz -v v3_0rc1 -o ORCID_2020_10_summaries_json.tar.gz
tar xvf ORCID_2020_10_summaries_json.tar.gz
fd .json ORCID_2020_10_summaries/ | parallel cat {} | jq . -c | pv -l | gzip > ORCID_2020_10_summaries.json.gz
zcat ORCID_2020_10_summaries.json.gz | shuf -n10000 | gzip > ORCID_2020_10_summaries.sample_10k.json.gz
ia upload orcid-dump-2020 ORCID_2020_10_summaries_json.tar.gz ORCID_2020_10_summaries.sample_10k.json.gz
## Import
Fetch to prod machine:
wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.json.gz
wget https://archive.org/download/orcid-dump-2020/ORCID_2020_10_summaries.sample_10k.json.gz
Sample:
export FATCAT_AUTH_WORKER_ORCID=[...]
zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.sample_10k.json.gz | ./fatcat_import.py orcid -
=> Counter({'total': 10000, 'exists': 7356, 'insert': 2465, 'skip': 179, 'update': 0})
Bulk import:
export FATCAT_AUTH_WORKER_ORCID=[...]
time zcat /srv/fatcat/datasets/ORCID_2020_10_summaries.json.gz | pv -l | parallel -j8 --round-robin --pipe ./fatcat_import.py orcid -
=> Counter({'total': 1208991, 'exists': 888696, 'insert': 299008, 'skip': 21287, 'update': 0})
=> (8x of the above, roughly)
real 88m40.960s
user 389m35.344s
sys 23m18.396s
Before: Size: 673.36G
After: Size: 675.55G
|