1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
|
## JALC
Update to eee39965eee92b5005df0d967be779c2f2bb15f8
export FATCAT_AUTH_WORKER_JALC=blah
Extracted file instead of piping it through zcat.
Start small; do a random bunch (10k) single-threaded to pre-create containers:
head -n100 /srv/fatcat/datasets/JALC-LOD-20180907.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
shuf -n100 /srv/fatcat/datasets/JALC-LOD-20180907.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
shuf -n10000 /srv/fatcat/datasets/JALC-LOD-20180907.rdf | ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
Counter({'total': 9971, 'insert': 7138, 'exists': 2826, 'inserted.container': 144, 'skip': 7, 'update': 0})
Then the command:
cat /srv/fatcat/datasets/JALC-LOD-20180907.rdf | pv -l | time parallel -j20 --round-robin --pipe ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
Bulk import:
cat /srv/fatcat/datasets/JALC-LOD-20180907.rdf | pv -l | time parallel -j20 --round-robin --pipe ./fatcat_import.py --batch-size 100 jalc - /srv/fatcat/datasets/ISSN-to-ISSN-L.txt --extid-map-file /srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3
Hit an error:
Traceback (most recent call last):
File "./fatcat_import.py", line 365, in <module>
main()
File "./fatcat_import.py", line 362, in main
args.func(args)
File "./fatcat_import.py", line 23, in run_jalc
Bs4XmlLinesPusher(ji, args.xml_file, "<rdf:Description").run()
File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 605, in run
self.importer.push_record(soup)
File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 302, in push_record
entity = self.parse_record(raw_record)
File "/srv/fatcat/src/python/fatcat_tools/importers/jalc.py", line 261, in parse_record
publisher = clean(pubs[0])
IndexError: list index out of range
[...]
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 320733, 'insert': 227567, 'exists': 92651, 'skip': 515, 'inserted.container': 53, 'update': 0})
Using external ID map: file:/srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3?mode=ro
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 317741, 'insert': 226336, 'exists': 91232, 'skip': 173, 'inserted.container': 64, 'update': 0})
Using external ID map: file:/srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3?mode=ro
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 318022, 'insert': 230063, 'exists': 87852, 'skip': 107, 'inserted.container': 51, 'update': 0})
Using external ID map: file:/srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3?mode=ro
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 317404, 'insert': 225893, 'exists': 91363, 'skip': 148, 'inserted.container': 45, 'update': 0})
Command exited with non-zero status 1
70293.61user 1088.65system 4:06:04elapsed 483%CPU (0avgtext+0avgdata 449340maxresident)k
1548632inputs+13813200outputs (248major+3685889minor)pagefaults 0swaps
Re-ran with same command after patching, and success:
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 321098, 'exists': 319095, 'insert': 1726, 'skip': 277, 'update': 0})
Using external ID map: file:/srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3?mode=ro
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 317416, 'exists': 315055, 'insert': 1871, 'skip': 490, 'update': 0})
Using external ID map: file:/srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3?mode=ro
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 315676, 'exists': 313906, 'insert': 1653, 'skip': 117, 'update': 0})
Using external ID map: file:/srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3?mode=ro
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 308695, 'exists': 306407, 'insert': 1856, 'skip': 432, 'update': 0})
Using external ID map: file:/srv/fatcat/datasets/release_ids.ia_munge_20180908.sqlite3?mode=ro
Loading ISSN map file...
Got 2153874 ISSN-L mappings.
Counter({'total': 310210, 'exists': 308280, 'insert': 1782, 'skip': 148, 'update': 0})
71531.84user 1225.33system 1:17:04elapsed 1573%CPU (0avgtext+0avgdata 425368maxresident)k
1195624inputs+14971088outputs (238major+2895079minor)pagefaults 0swaps
## Journal Metadata Update
Updating with fixed KBART year_spans, for better coverage detection.
export FATCAT_AUTH_WORKER_JOURNAL_METADATA=...
./fatcat_import.py journal-metadata /srv/fatcat/datasets/journal_metadata.2019-02-20.fixed.json
Counter({'total': 107793, 'exists': 95921, 'update': 11549, 'insert': 270, 'skip': 53})
## PubMed
export FATCAT_AUTH_WORKER_PUBMED=...
Start small (and cut off) to ensure getting basics correct:
./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2019/pubmed19n0400.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
Kick off the big one:
fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2019 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
Seemed to hang or something...
fatcat 1649 0.1 0.1 2335588 56076 pts/2 S Jun01 5:05 python3 ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2019/pubmed19n0966.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
fatcat 9460 0.2 0.1 2333520 54004 pts/2 S May31 12:21 python3 ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2019/pubmed19n0383.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
fatcat_client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Content-Length': '183', 'X-Clacks-Overhead': 'GNU aaronsw, jpb', 'X-Span-ID': '563f6833-be1e-452e-bcd6-e7c721edf9eb', 'Content-Type': 'application/json', 'Date': 'Sat, 01 Jun 2019 12:31:11 GMT'})
HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a PubMed Central ID (PMCID) (expected, eg, 'PMC12345'): wst_2018_414"}
And another:
fatcat_client.rest.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Sat, 01 Jun 2019 12:37:01 GMT', 'Content-Type': 'application/json', 'Content-Length': '182', 'X-Span-ID': 'c8cbcffb-d3c5-4ceb-b157-d628dbac613f', 'X-Clacks-Overhead': 'GNU aaronsw, jpb'})
HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a PubMed Central ID (PMCID) (expected, eg, 'PMC12345'): wh_2018_033"}
And another (jeeze!):
HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a PubMed Central ID (PMCID) (expected, eg, 'PMC12345'): wst_2018_399"}
And another derp:
Traceback (most recent call last):
File "./fatcat_import.py", line 365, in <module>
main()
File "./fatcat_import.py", line 362, in main
args.func(args)
File "./fatcat_import.py", line 43, in run_pubmed
Bs4XmlLargeFilePusher(pi, args.xml_file, "PubmedArticle", record_list_tag="PubmedArticleSet").run()
File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 666, in run
self.importer.push_record(record)
File "/srv/fatcat/src/python/fatcat_tools/importers/common.py", line 302, in push_record
entity = self.parse_record(raw_record)
File "/srv/fatcat/src/python/fatcat_tools/importers/pubmed.py", line 494, in parse_record
int(pub_date.Day.string))
ValueError: day is out of range for month
Lesson here is to really get the whole thing to work end-to-end with no
`parallel` error in QA before trying in prod. Was impatient!
TODO: re-run these with a patch. going to do after dump/snapshot/etc though.
|