1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
|
## Arxiv
Used metha-sync tool to update. Then went in raw storage directory (as opposed
to using `metha-cat`) and plucked out weekly files updated since last import.
Created a tarball and uploaded to:
https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz
Downloaded, extracted, then unzipped:
gunzip *.gz
Run importer:
export FATCAT_AUTH_WORKER_ARXIV=...
./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
# Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})
fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}
Things seem to run smoothly in QA. New releases get grouped with old works
correctly, no duplication obvious.
In prod, loaded just the first file as a start, waiting to see if auto-ingest
happens. Looks like yes! Great that everything is so smooth. All seem to be new
captures.
In production prod elasticsearch, 2,377,645 arxiv releases before this
updated import, 741,033 with files attached. Guessing about 150k new releases,
but will check.
Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
781,122 with fulltext.
## Pubmed
Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>
gunzip *.xml.gz
Run importer:
export FATCAT_AUTH_WORKER_PUBMED=...
./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
# Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3})
Noticed that `release_year` was not getting set for many releases. Made a small
code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another:
time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
# Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1})
real 30m45.044s
user 16m43.672s
sys 0m10.792s
time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
More errors:
HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"}
HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"}
HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"}
BOGUS DOI: 10.1037//0021-843x.106.2.266
BOGUS DOI: 10.1037//0021-843x.106.2.280
=> actual ok? at least redirect ok
unparsable medline date, skipping: Summer 2018
TODO:
- fix bad DOI error (real error, skip these)
- remove newline after "unparsable medline date" error
- remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning
|