extra/bulk_edits/2019-12-20_updates.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137


## Arxiv

Used metha-sync tool to update. Then went in raw storage directory (as opposed
to using `metha-cat`) and plucked out weekly files updated since last import.
Created a tarball and uploaded to:

    https://archive.org/download/arxiv_raw_oai_snapshot_2019-05-22/arxiv_20190522_20191220.tar.gz

Downloaded, extracted, then unzipped:

    gunzip *.gz

Run importer:

    export FATCAT_AUTH_WORKER_ARXIV=...

    ./fatcat_import.py --batch-size 100 arxiv /srv/fatcat/datasets/arxiv_20190522_20191220/2019-05-31-00000000.xml
    # Counter({'exists': 1785, 'total': 1001, 'insert': 549, 'skip': 1, 'update': 0})

    fd .xml /srv/fatcat/datasets/arxiv_20190522_20191220/ | parallel -j15 ./fatcat_import.py --batch-size 100 arxiv {}

Things seem to run smoothly in QA. New releases get grouped with old works
correctly, no duplication obvious.

In prod, loaded just the first file as a start, waiting to see if auto-ingest
happens. Looks like yes! Great that everything is so smooth. All seem to be new
captures.

In production prod elasticsearch, 2,377,645 arxiv releases before this
updated import, 741,033 with files attached. Guessing about 150k new releases,
but will check.

Up to 2,531,542 arxiv releases, so only 154k or so new releases created.
781,122 with fulltext.

## Pubmed QA

Grabbed fresh 2020 baseline, released in December 2019: <https://archive.org/details/pubmed_medline_baseline_2020>

    gunzip *.xml.gz

Run importer:

    export FATCAT_AUTH_WORKER_PUBMED=...

    ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1000.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt

    # Counter({'total': 29975, 'update': 26650, 'skip': 2081, 'insert': 1193, 'warn-pmid-doi-mismatch': 36, 'exists': 36, 'skip-update-conflict': 15, 'inserted.container': 3})

Noticed that `release_year` was not getting set for many releases. Made a small
code tweak (`1bb0a2181d5a30241d80279c5930eb753733f30b`) and trying another:

    time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt

    # Counter({'total': 30000, 'update': 25912, 'skip': 2119, 'insert': 1935, 'exists': 29, 'warn-pmid-doi-mismatch': 27, 'skip-update-conflict': 5, 'inserted.container': 1})

    real    30m45.044s
    user    16m43.672s
    sys     0m10.792s

    time fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt

More errors:

    HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.3760/cma. j. issn.2095-4352. 2014. 07.014"}
    HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.13201/j.issn.10011781.2016.06.002"}
    HTTP response body: {"success":false,"error":"MalformedExternalId","message":"external identifier doesn't match required pattern for a DOI (expected, eg, '10.1234/aksjdfh'): 10.23750/abm.v88i2 -s.6506"}


    10.1037//0002-9432.72.1.50
    BOGUS DOI: 10.1037//0021-843x.106.2.266
    BOGUS DOI: 10.1037//0021-843x.106.2.280
    => actual ok? at least redirect ok

    unparsable medline date, skipping: Summer 2018

TODO:
x fix bad DOI error (real error, skip these)
x remove newline after "unparsable medline date" error
x remove extra line like "existing.ident, existing.ext_ids.pmid, re.ext_ids.pmid))" in warning

NOTE: Remember having run through the entire baseline in QA, but didn't save the command or output.

## Pubmed Prod (2020-01-17)

This is after adding a flag to enforce no updates at all, only new releases.
Will likely revisit and run through with updates that add important metadata
like exact references matches for older releases, after doing release
merge/group cleanups.


    # git commit: d55d45ad667ccf34332b2ce55e8befbd212922ec
    # had a trivial typo in fatcat_import.py, will push a fix
    export FATCAT_AUTH_WORKER_PUBMED=...
    time ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n1001.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt

Full run:

    fd '.xml$' /srv/fatcat/datasets/pubmed_medline_baseline_2020 | time parallel -j16 ./fatcat_import.py pubmed {} /srv/fatcat/datasets/ISSN-to-ISSN-L.txt

    [...]
    Command exited with non-zero status 2
    1271708.20user 23689.44system 31:42:15elapsed 1134%CPU (0avgtext+0avgdata 584588maxresident)k
    486129672inputs+2998072outputs (3672major+139751796minor)pagefaults 0swaps

    => so apparently 2x tasks failed
    => 1271708 = 353 hours... but what walltime? about 31-32 hours if divide by CPU

Only received a single exception at:

    Jan 18, 2020 8:33:09 AM UTC
    /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml
    MalformedExternalId: 10.4149/gpb¬_2017042

Not sure what the other failure was... maybe an invalid filename or argument,
before processing actually started? Or some failure (OOM) that prevented sentry
reporting?

Patch normal.py and re-run that single file:

    ./fatcat_import.py pubmed /srv/fatcat/datasets/pubmed_medline_baseline_2020/pubmed20n0936.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
    [...]
    Counter({'total': 30000, 'exists': 27243, 'skip': 1605, 'insert': 1152, 'warn-pmid-doi-mismatch': 26, 'update': 0})

Done!

## Chocula

Command:

    export FATCAT_AUTH_WORKER_JOURNAL_METADATA=[...]
    ./fatcat_import.py chocula /srv/fatcat/datasets/export_fatcat.2019-12-26.json

Result:

    Counter({'total': 144455, 'exists': 139807, 'insert': 2384, 'skip': 2264, 'skip-unknown-new-issnl': 2264, 'exists-by-issnl': 306, 'update': 0})