1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
|
Hand investigation of records that don't match to fatcat metadata.
## Overview
811 such records from 2020-03-20 CORD-19 dataset.
221 have some DOI
242 have a PMCID
12 have a PMID
jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | wc -l
=> 364
jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .source_x -r | sort | uniq -c | sort -nr
288 WHO
76 Elsevier
jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .title | sort | uniq -c | head
=> 62 have empty titles, all from Elsevier
jq .title | sort | uniq -c | head
=> from full batch, 224 have empty titles; no duplicates
214 Elsevier
9 PMC
1 PMC_new
## Lacking Identifiers, Has Title
jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c
=> 302 such papers
Random sample 10:
"Surgical management strategies for orthopedic trauma patients under epidemic of novel coronavirus pneumonia"
Chinese Journal of Trauma, 2020
=> Volume: 36, Issue: 2, pp 124-128
=> https://academic.microsoft.com/paper/3009524916/related
"Molecular biology of microbial pathogenicity: Adhesion, invasion and receptors"
FEMS Microbiology Letters, 1984
=> elsevier
=> missing DOI
=> free PDF, hard to crawl (OUP)
=> https://fatcat.wiki/release/wbzyyyuhcfhnjfmuouygqyi7vy
"Recommendations for general surgery clinical practice in novel coronavirus pneumonia situation"
Chinese Journal of Surgery, 2020
=> missing DOI and PMID
=> DOI not in crossref
=> free PDF download
=> https://fatcat.wiki/release/3przclac6jfarpru2ol4xafmie
"Author index volume 20 (1983)"
FEMS Microbiology Letters, 1983
=> stub
"Paying close attention to diabetic patients with novel coronavirus infection"
Chinese Journal of Endocrinology and Metabolism, 2020
=> https://academic.microsoft.com/paper/3010104673/related
=> no fatcat
"The recommendation for management of the bereavements among the family members died with novel coronavirus pneumonia"
Chinese Journal of Behavioral Medicine and Brain Science, 2020
=> https://academic.microsoft.com/paper/3009517691/related
=> 1674-6554
=> 10.3760/cma.j.issn.1674-6554.2020.... (?)
=> there are english and chinese websites for journal, but chinese more up-to-date
=> http://med.wanfangdata.com.cn/Periodical/zgxwyxkx
=> publisher resources:
http://subject.med.wanfangdata.com.cn/Channel/7
https://translate.googleusercontent.com/translate_c?depth=1&rurl=translate.google.com&sl=auto&source=osdd&sp=nmt4&tl=en&u=http://subject.med.wanfangdata.com.cn/Channel/7&usg=ALkJrhj0J6IQM6RKNFYPKpMshQS9UcS2oQ
"Singapore claims first use of antibody test to track coronavirus infections | Science | AAAS"
Science Magazine, 2020
=> https://fatcat.wiki/release/ovghasgr6bclbj2ksbnaz635ci
=> simply missing DOI
"High resolution CT findings and clinical features of novel coronavirus pneumonia in Guangzhou"
Chinese Journal of Radiology, 2020
=> https://academic.microsoft.com/paper/3010637392/related
=> http://www.cjrjournal.org/
=> https://fatcat.wiki/container/grktk23p5rayfncvh3bm6ylbwy (old only)
=> not seeing this paper in any recent (2020) issues
"COVID-19 Update From China"
JAMA, 2020
=> seems to be an audio recording and/or video on "JN Learning"
=> https://edhub.ama-assn.org/jn-learning/audio-player/18234306
=> https://edhub.ama-assn.org/jn-learning/video-player/18234510
=> abstract matches
"Coronavirus latest: death toll passes 2,000"
Nature, 2020
=> this is a commentary article which is getting continuously updated, with the title changing
=> DOI: 10.1038/d41586-020-00154-w
=> do have a wayback snapshot with exact title:
=> http://web.archive.org/web/20200220185055/https://www.nature.com/articles/d41586-020-00154-w
Summary:
- some informal material mixed in (commentary/opinion/video)
- bunch of chinese papers, often with DOI, not indexed in western databases,
but are in MAG. MAG identifiers mostly not included. could get metadata from
wanfang data (company)?
- some stub articles: eg, indexes
- some exact title matches, missing identifier (pubmed_id, DOI) matches
Top journals:
46 (none)
15 Chinese Journal of Laboratory Medicine
=> fatcat container, no papers
14 FEMS Microbiology Letters
=> fatcat container, OUP, many papers
11 Chinese Journal of Medical Science Research Management
=> no fatcat
=> 1006-1924
=> https://portal.issn.org/resource/ISSN/1006-1924
=> "Zhonghua yixue keyan guanli zazhi"
11 Chinese Journal of Hospital Administration
10 Chinese Journal of Radiology
10 Chinese Journal of Emergency Medicine
9 Chinese Journal of Preventive Medicine
9 Chinese Journal of Experimental Ophthalmology
9 Chinese Critical Care Medicine
8 Chongqing Medicine
8 Chinese Journal of Infectious Diseases
=> in fatcat (twice?)
=> "Zhonghua chuan ran bing za zhi"
7 Chinese Journal of Trauma
=> no fatcat
=> in MAG
[...]
## DOIs
jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c
=> 194 such papers
jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .source_x -r | sort | uniq -c
176 Elsevier
5 medrxiv
13 WHO
jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .title -r | sort | uniq -c | sort -nr | head
22 Index
7 Subject Index
2 S
2 C
1 V
Trying a sample of DOIs that didn't match:
jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | shuf -n10 | jq .doi -r
10.1101/2020.01.31.20019935
=> medrxiv, not sure why not in fatcat
10.1016/j.diagmicrobio.2004.10.002
=> elsevier, not in fatcat
10.1016/B978-0-323-53045-3.00035-0
=> no such doi
10.1016/B978-0-323-37591-7.00038-0
=> no such doi
10.1016/B978-0-323-52993-8.00048-5
=> no such doi
10.1016/B978-0-323-44887-1.00022-5
=> no such doi
10.1016/B978-343721804-0.50036-3
=> elsevier, index
10.1016/B978-0-323-44887-1.00044-4
=> no such doi
=> https://www.us.elsevierhealth.com/kendigs-disorders-of-the-respiratory-tract-in-children-9780323448871.html
10.1016/B978-0-323-04579-7.00035-6
=> no such doi
10.1016/B978-0-323-55512-8.00140-X
=> no such doi
=> 10.1016/B978-0-323-55512-8.00140-X
=> DOI is listed on elsevier homepage
=> https://www.sciencedirect.com/science/article/pii/B978032355512800140X
Many of these "no such DOI" may be partially registered?
Top journals (actually books?) with missing DOIs:
bnewbold@orithena$ cat missing.json | jq .who_paper -c | jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .journal | sort | uniq -c | sort -nr | head
28 "Mandell, Douglas, and Bennett's Principles and Practice of Infectious Diseases"
13 "Principles and Practice of Pediatric Infectious Diseases"
11 "Infectious Diseases"
9 "Hunter's Tropical Medicine and Emerging Infectious Diseases"
7 "Kendig's Disorders of the Respiratory Tract in Children"
5 ""
4 "Vaccine"
4 "The Dictionary of Cell & Molecular Biology"
4 "Clinical Immunology"
3 "Zakim and Boyer's Hepatology"
[...]
## PMID/PMCID
jq 'select(.doi == "" and .pubmed_id !="")' -c
=> 8
jq 'select(.doi == "" and .pmcid !="")' -c
=> 223
All of these PMIDs and PMCIDs seem to be valid. From a quick scan, they seem to
not be in fatcat because there are already works there with DOIs.
## Recommendations
In your data munging, filter out:
- works with blank titles and no external identifier (eg, no DOI, PMCID, PMID, MAG ID)
- works with blocklist titles (see below). I assume these got included due to
fulltext search matches, but I think are just noise
- works with titles which are a single capital letter (eg, "S", "C")
Title blocklist (preceeded by count); usually I do these by lower-casing and
striping non-alphanumeric characters before comparing:
348 Index
83 Subject Index
76 Subject index
69 Author index
68 Contents
67 Articles of Significant Interest Selected from This Issue by the Editors
66 Information for Authors
36 Graphical contents list
29 Table of Contents
21 In brief
20 Preface
20 Editorial Board
19 Author Index
18 Volume Contents
18 Research brief
18 Abstracts
15 Opportunities from the Center for Perioperative Education
13 Keyword index
12 PNAS Plus Significance Statements
11 In This Issue
10 Current Awareness on Comparative and Functional Genomics
9 Introduction
9 Highlights of this issue
9 Contents list
9 Abbreviations
8 QUIZ CORNER
8 Positions available
8 Journal Watch
8 Index of Authors
8 Editorial
8 Cumulative Index
7 Table of contents
7 Quiz Corner
7 Index of Subjects
7 INDEX
7 Foreword
7 Bibliography of the current world literature
6 Viral gastroenteritis
6 Public Health Watch
6 Contributors
6 Contents of Volume
6 Contents of other veterinary journals from Elsevier
Highlights of this issue
QUIZ CORNER
Answers to Quiz Corner
## Notes
There is some chinese journal registry number, eg "ISSN 1005-1201 CN 11-2149/R"
Interesting sites to crawl or translate:
http://medjournals.cn/index.do
=> 1 million papers
http://rs.yiigle.com/yufabiao/1181337.htm
=> paper repository/host?
http://subject.med.wanfangdata.com.cn/Channel/7?mark=34
=> list of papers? wanfang data seems like a large publisher
http://www.wanfangdata.com/about/about.asp
=> mainland china (beijing)
=> commerical/national holder of 40+ million papers
=> indexed by EBSCO
## Fetching Metata
cat metadata/cord19.2020-03-27.missing.json | jq 'select(.doi != "") | .doi' -r | sort -u > missing_doi.tsv
cat metadata/cord19.2020-03-27.missing.json | jq 'select(.pubmed_id != "") | .pubmed_id' -r | sort -u > missing_pmid.tsv
cat metadata/cord19.2020-03-27.missing.json | jq 'select(.pmcid != "") | .pmcid' -r | sort -u > missing_pmcid.tsv
cat missing_doi.tsv | parallel -j4 'http --headers head "https://doi.org/{}" | head -n1 | awk "{print \"{}\t\" \$2}"' > missing_doi_status.tsv
cat missing_doi_status.tsv | rg '404$' | cut -f1 > unregistered_doi.tsv
cat missing_doi_status.tsv | rg '302$' | cut -f1 | parallel -j1 'http --json get "https://api.crossref.org/v1/works/http://dx.doi.org/{}" mailto==webservices@archive.org' | rg '^\{' | jq .message -c | pv -l > missing_doi_crossref.json
mkdir -p pubmed
cat missing_pmid.tsv | parallel -j1 'http get "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={}&rettype=pubmed" > pubmed/{}.xml'
cat missing_pmcid.tsv | parallel -j1 'http get "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id={}&rettype=pubmed" > pubmed/{}.xml'
cat pubmed/*.xml | rg -v '^<\?xml version' | rg -v '^<!DOCTYPE' | rg -v '^<PubmedArticleSet>' | rg -v '^</PubmedArticleSet>' > pubmed_combined.xml
# Edit file manually to add `<PubmedArticleSet>` and `</PubmedArticleSet>` wrapper.
# in prod:
./fatcat_import.py pubmed --do-updates /tmp/pubmed_combined.xml /srv/fatcat/datasets/ISSN-to-ISSN-L.txt
|