aboutsummaryrefslogtreecommitdiffstats
path: root/notes/missing_2020-03-20.md
blob: 4eaaa4e38e226b24f530f923ceb1ed74fd87c3f6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305

Hand investigation of records that don't match to fatcat metadata.

## Overview

    811 such records from 2020-03-20 CORD-19 dataset.
    221 have some DOI
    242 have a PMCID
    12 have a PMID

    jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | wc -l
    => 364

    jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .source_x -r | sort | uniq -c | sort -nr
    288 WHO
     76 Elsevier

    jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .title | sort | uniq -c | head
    => 62 have empty titles, all from Elsevier

    jq .title | sort | uniq -c | head
    => from full batch, 224 have empty titles; no duplicates
      214 Elsevier
        9 PMC
        1 PMC_new

## Lacking Identifiers, Has Title

    jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c
    => 302 such papers

Random sample 10:

    "Surgical management strategies for orthopedic trauma patients under epidemic of novel coronavirus pneumonia"
    Chinese Journal of Trauma, 2020
    => Volume: 36, Issue: 2, pp 124-128
    => https://academic.microsoft.com/paper/3009524916/related

    "Molecular biology of microbial pathogenicity: Adhesion, invasion and receptors"
    FEMS Microbiology Letters, 1984
    => elsevier
    => missing DOI
    => free PDF, hard to crawl (OUP)
    => https://fatcat.wiki/release/wbzyyyuhcfhnjfmuouygqyi7vy

    "Recommendations for general surgery clinical practice in novel coronavirus pneumonia situation"
    Chinese Journal of Surgery, 2020
    => missing DOI and PMID
    => DOI not in crossref
    => free PDF download
    => https://fatcat.wiki/release/3przclac6jfarpru2ol4xafmie

    "Author index volume 20 (1983)"
    FEMS Microbiology Letters, 1983
    => stub

    "Paying close attention to diabetic patients with novel coronavirus infection"
    Chinese Journal of Endocrinology and Metabolism, 2020
    => https://academic.microsoft.com/paper/3010104673/related
    => no fatcat

    "The recommendation for management of the bereavements among the family members died with novel coronavirus pneumonia"
    Chinese Journal of Behavioral Medicine and Brain Science, 2020
    => https://academic.microsoft.com/paper/3009517691/related
    => 1674-6554
    => 10.3760/cma.j.issn.1674-6554.2020.... (?)
    => there are english and chinese websites for journal, but chinese more up-to-date
    => http://med.wanfangdata.com.cn/Periodical/zgxwyxkx
    => publisher resources:
        http://subject.med.wanfangdata.com.cn/Channel/7
        https://translate.googleusercontent.com/translate_c?depth=1&rurl=translate.google.com&sl=auto&source=osdd&sp=nmt4&tl=en&u=http://subject.med.wanfangdata.com.cn/Channel/7&usg=ALkJrhj0J6IQM6RKNFYPKpMshQS9UcS2oQ

    "Singapore claims first use of antibody test to track coronavirus infections | Science | AAAS"
    Science Magazine, 2020
    => https://fatcat.wiki/release/ovghasgr6bclbj2ksbnaz635ci
    => simply missing DOI

    "High resolution CT findings and clinical features of novel coronavirus pneumonia in Guangzhou"
    Chinese Journal of Radiology, 2020
    => https://academic.microsoft.com/paper/3010637392/related
    => http://www.cjrjournal.org/
    => https://fatcat.wiki/container/grktk23p5rayfncvh3bm6ylbwy (old only)
    => not seeing this paper in any recent (2020) issues

    "COVID-19 Update From China"
    JAMA, 2020
    => seems to be an audio recording and/or video on "JN Learning"
    => https://edhub.ama-assn.org/jn-learning/audio-player/18234306
    => https://edhub.ama-assn.org/jn-learning/video-player/18234510
    => abstract matches

    "Coronavirus latest: death toll passes 2,000"
    Nature, 2020
    => this is a commentary article which is getting continuously updated, with the title changing
    => DOI: 10.1038/d41586-020-00154-w
    => do have a wayback snapshot with exact title:
    => http://web.archive.org/web/20200220185055/https://www.nature.com/articles/d41586-020-00154-w

Summary:
- some informal material mixed in (commentary/opinion/video)
- bunch of chinese papers, often with DOI, not indexed in western databases,
  but are in MAG. MAG identifiers mostly not included. could get metadata from
  wanfang data (company)?
- some stub articles: eg, indexes
- some exact title matches, missing identifier (pubmed_id, DOI) matches

Top journals:

     46 (none)
     15 Chinese Journal of Laboratory Medicine
        => fatcat container, no papers
     14 FEMS Microbiology Letters
        => fatcat container, OUP, many papers
     11 Chinese Journal of Medical Science Research Management
        => no fatcat
        => 1006-1924
        => https://portal.issn.org/resource/ISSN/1006-1924
        => "Zhonghua yixue keyan guanli zazhi"
     11 Chinese Journal of Hospital Administration
     10 Chinese Journal of Radiology
     10 Chinese Journal of Emergency Medicine
      9 Chinese Journal of Preventive Medicine
      9 Chinese Journal of Experimental Ophthalmology
      9 Chinese Critical Care Medicine
      8 Chongqing Medicine
      8 Chinese Journal of Infectious Diseases
        => in fatcat (twice?)
        => "Zhonghua chuan ran bing za zhi"
      7 Chinese Journal of Trauma
        => no fatcat
        => in MAG
    [...]

## DOIs

    jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c
    => 194 such papers

    jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c |  jq .source_x -r | sort | uniq -c
    176 Elsevier
      5 medrxiv
     13 WHO

    jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c |  jq .title -r | sort | uniq -c | sort -nr | head
     22 Index
      7 Subject Index
      2 S
      2 C
      1 V

Trying a sample of DOIs that didn't match:

    jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c |  shuf -n10 | jq .doi -r

    10.1101/2020.01.31.20019935
    => medrxiv, not sure why not in fatcat
    10.1016/j.diagmicrobio.2004.10.002
    => elsevier, not in fatcat

    10.1016/B978-0-323-53045-3.00035-0
    => no such doi
    10.1016/B978-0-323-37591-7.00038-0
    => no such doi
    10.1016/B978-0-323-52993-8.00048-5
    => no such doi
    10.1016/B978-0-323-44887-1.00022-5
    => no such doi
    10.1016/B978-343721804-0.50036-3
    => elsevier, index
    10.1016/B978-0-323-44887-1.00044-4
    => no such doi
    => https://www.us.elsevierhealth.com/kendigs-disorders-of-the-respiratory-tract-in-children-9780323448871.html
    10.1016/B978-0-323-04579-7.00035-6
    => no such doi
    10.1016/B978-0-323-55512-8.00140-X
    => no such doi
    => 10.1016/B978-0-323-55512-8.00140-X
    => DOI is listed on elsevier homepage
    => https://www.sciencedirect.com/science/article/pii/B978032355512800140X

Many of these "no such DOI" may be partially registered?

Top journals (actually books?) with missing DOIs:

    bnewbold@orithena$ cat missing.json | jq .who_paper -c | jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c |  jq .journal | sort | uniq -c | sort -nr | head
         28 "Mandell, Douglas, and Bennett's Principles and Practice of Infectious Diseases"
         13 "Principles and Practice of Pediatric Infectious Diseases"
         11 "Infectious Diseases"
          9 "Hunter's Tropical Medicine and Emerging Infectious Diseases"
          7 "Kendig's Disorders of the Respiratory Tract in Children"
          5 ""
          4 "Vaccine"
          4 "The Dictionary of Cell & Molecular Biology"
          4 "Clinical Immunology"
          3 "Zakim and Boyer's Hepatology"
          [...]

## PMID/PMCID

    jq 'select(.doi == "" and .pubmed_id !="")' -c
    => 8

    jq 'select(.doi == "" and .pmcid !="")' -c
    => 223

All of these PMIDs and PMCIDs seem to be valid. From a quick scan, they seem to
not be in fatcat because there are already works there with DOIs.

## Recommendations

In your data munging, filter out:

- works with blank titles and no external identifier (eg, no DOI, PMCID, PMID, MAG ID)
- works with blocklist titles (see below). I assume these got included due to
  fulltext search matches, but I think are just noise
- works with titles which are a single capital letter (eg, "S", "C")

Title blocklist (preceeded by count); usually I do these by lower-casing and
striping non-alphanumeric characters before comparing:

    348 Index
     83 Subject Index
     76 Subject index
     69 Author index
     68 Contents
     67 Articles of Significant Interest Selected from This Issue by the Editors
     66 Information for Authors
     36 Graphical contents list
     29 Table of Contents
     21 In brief
     20 Preface
     20 Editorial Board
     19 Author Index
     18 Volume Contents
     18 Research brief
     18 Abstracts
     15 Opportunities from the Center for Perioperative Education
     13 Keyword index
     12 PNAS Plus Significance Statements
     11 In This Issue
     10 Current Awareness on Comparative and Functional Genomics
      9 Introduction
      9 Highlights of this issue
      9 Contents list
      9 Abbreviations
      8 QUIZ CORNER
      8 Positions available
      8 Journal Watch
      8 Index of Authors
      8 Editorial
      8 Cumulative Index
      7 Table of contents
      7 Quiz Corner
      7 Index of Subjects
      7 INDEX
      7 Foreword
      7 Bibliography of the current world literature
      6 Viral gastroenteritis
      6 Public Health Watch
      6 Contributors
      6 Contents of Volume
      6 Contents of other veterinary journals from Elsevier
        Highlights of this issue
        QUIZ CORNER
        Answers to Quiz Corner

## Notes

There is some chinese journal registry number, eg "ISSN 1005-1201 CN 11-2149/R"

Interesting sites to crawl or translate:
    http://medjournals.cn/index.do
        => 1 million papers
    http://rs.yiigle.com/yufabiao/1181337.htm
        => paper repository/host?
    http://subject.med.wanfangdata.com.cn/Channel/7?mark=34
        => list of papers? wanfang data seems like a large publisher
    http://www.wanfangdata.com/about/about.asp
        => mainland china (beijing)
        => commerical/national holder of 40+ million papers
        => indexed by EBSCO

## Fetching Metata

    cat metadata/cord19.2020-03-27.missing.json | jq 'select(.doi != "") | .doi' -r | sort -u > missing_doi.tsv
    cat metadata/cord19.2020-03-27.missing.json | jq 'select(.pubmed_id != "") | .pubmed_id' -r | sort -u > missing_pmid.tsv
    cat metadata/cord19.2020-03-27.missing.json | jq 'select(.pmcid != "") | .pmcid' -r | sort -u > missing_pmcid.tsv

    cat missing_doi.tsv | parallel -j4 'http --headers head "https://doi.org/{}" | head -n1 | awk "{print \"{}\t\" \$2}"' > missing_doi_status.tsv

    cat missing_doi_status.tsv | rg '404$' | cut -f1 > unregistered_doi.tsv

    cat missing_doi_status.tsv | rg '302$' | cut -f1 | parallel -j1 'http --json get "https://api.crossref.org/v1/works/http://dx.doi.org/{}" mailto==webservices@archive.org' | rg '^\{' | jq .message -c | pv -l > missing_doi_crossref.json

    mkdir -p pubmed
    cat missing_pmid.tsv | parallel -j1 'http get "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={}&rettype=pubmed" > pubmed/{}.xml'
    cat missing_pmcid.tsv | parallel -j1 'http get "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id={}&rettype=pubmed" > pubmed/{}.xml'

    cat pubmed/*.xml | rg -v '^<\?xml version' | rg -v '^<!DOCTYPE' | rg -v '^<PubmedArticleSet>' | rg -v '^</PubmedArticleSet>' > pubmed_combined.xml

    # Edit file manually to add `<PubmedArticleSet>` and `</PubmedArticleSet>` wrapper.

    # in prod:
    ./fatcat_import.py pubmed --do-updates /tmp/pubmed_combined.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt