aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
Diffstat (limited to 'notes')
-rw-r--r--notes/missing_2020-03-20.md282
1 files changed, 282 insertions, 0 deletions
diff --git a/notes/missing_2020-03-20.md b/notes/missing_2020-03-20.md
new file mode 100644
index 0000000..fa2705b
--- /dev/null
+++ b/notes/missing_2020-03-20.md
@@ -0,0 +1,282 @@
+
+Hand investigation of records that don't match to fatcat metadata.
+
+## Overview
+
+ 811 such records from 2020-03-20 CORD-19 dataset.
+ 221 have some DOI
+ 242 have a PMCID
+ 12 have a PMID
+
+ jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | wc -l
+ => 364
+
+ jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .source_x -r | sort | uniq -c | sort -nr
+ 288 WHO
+ 76 Elsevier
+
+ jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .title | sort | uniq -c | head
+ => 62 have empty titles, all from Elsevier
+
+ jq .title | sort | uniq -c | head
+ => from full batch, 224 have empty titles; no duplicates
+ 214 Elsevier
+ 9 PMC
+ 1 PMC_new
+
+## Lacking Identifiers, Has Title
+
+ jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c
+ => 302 such papers
+
+Random sample 10:
+
+ "Surgical management strategies for orthopedic trauma patients under epidemic of novel coronavirus pneumonia"
+ Chinese Journal of Trauma, 2020
+ => Volume: 36, Issue: 2, pp 124-128
+ => https://academic.microsoft.com/paper/3009524916/related
+
+ "Molecular biology of microbial pathogenicity: Adhesion, invasion and receptors"
+ FEMS Microbiology Letters, 1984
+ => elsevier
+ => missing DOI
+ => free PDF, hard to crawl (OUP)
+ => https://fatcat.wiki/release/wbzyyyuhcfhnjfmuouygqyi7vy
+
+ "Recommendations for general surgery clinical practice in novel coronavirus pneumonia situation"
+ Chinese Journal of Surgery, 2020
+ => missing DOI and PMID
+ => DOI not in crossref
+ => free PDF download
+ => https://fatcat.wiki/release/3przclac6jfarpru2ol4xafmie
+
+ "Author index volume 20 (1983)"
+ FEMS Microbiology Letters, 1983
+ => stub
+
+ "Paying close attention to diabetic patients with novel coronavirus infection"
+ Chinese Journal of Endocrinology and Metabolism, 2020
+ => https://academic.microsoft.com/paper/3010104673/related
+ => no fatcat
+
+ "The recommendation for management of the bereavements among the family members died with novel coronavirus pneumonia"
+ Chinese Journal of Behavioral Medicine and Brain Science, 2020
+ => https://academic.microsoft.com/paper/3009517691/related
+ => 1674-6554
+ => 10.3760/cma.j.issn.1674-6554.2020.... (?)
+ => there are english and chinese websites for journal, but chinese more up-to-date
+ => http://med.wanfangdata.com.cn/Periodical/zgxwyxkx
+ => publisher resources:
+ http://subject.med.wanfangdata.com.cn/Channel/7
+ https://translate.googleusercontent.com/translate_c?depth=1&rurl=translate.google.com&sl=auto&source=osdd&sp=nmt4&tl=en&u=http://subject.med.wanfangdata.com.cn/Channel/7&usg=ALkJrhj0J6IQM6RKNFYPKpMshQS9UcS2oQ
+
+ "Singapore claims first use of antibody test to track coronavirus infections | Science | AAAS"
+ Science Magazine, 2020
+ => https://fatcat.wiki/release/ovghasgr6bclbj2ksbnaz635ci
+ => simply missing DOI
+
+ "High resolution CT findings and clinical features of novel coronavirus pneumonia in Guangzhou"
+ Chinese Journal of Radiology, 2020
+ => https://academic.microsoft.com/paper/3010637392/related
+ => http://www.cjrjournal.org/
+ => https://fatcat.wiki/container/grktk23p5rayfncvh3bm6ylbwy (old only)
+ => not seeing this paper in any recent (2020) issues
+
+ "COVID-19 Update From China"
+ JAMA, 2020
+ => seems to be an audio recording and/or video on "JN Learning"
+ => https://edhub.ama-assn.org/jn-learning/audio-player/18234306
+ => https://edhub.ama-assn.org/jn-learning/video-player/18234510
+ => abstract matches
+
+ "Coronavirus latest: death toll passes 2,000"
+ Nature, 2020
+ => this is a commentary article which is getting continuously updated, with the title changing
+ => DOI: 10.1038/d41586-020-00154-w
+ => do have a wayback snapshot with exact title:
+ => http://web.archive.org/web/20200220185055/https://www.nature.com/articles/d41586-020-00154-w
+
+Summary:
+- some informal material mixed in (commentary/opinion/video)
+- bunch of chinese papers, often with DOI, not indexed in western databases,
+ but are in MAG. MAG identifiers mostly not included. could get metadata from
+ wanfang data (company)?
+- some stub articles: eg, indexes
+- some exact title matches, missing identifier (pubmed_id, DOI) matches
+
+Top journals:
+
+ 46 (none)
+ 15 Chinese Journal of Laboratory Medicine
+ => fatcat container, no papers
+ 14 FEMS Microbiology Letters
+ => fatcat container, OUP, many papers
+ 11 Chinese Journal of Medical Science Research Management
+ => no fatcat
+ => 1006-1924
+ => https://portal.issn.org/resource/ISSN/1006-1924
+ => "Zhonghua yixue keyan guanli zazhi"
+ 11 Chinese Journal of Hospital Administration
+ 10 Chinese Journal of Radiology
+ 10 Chinese Journal of Emergency Medicine
+ 9 Chinese Journal of Preventive Medicine
+ 9 Chinese Journal of Experimental Ophthalmology
+ 9 Chinese Critical Care Medicine
+ 8 Chongqing Medicine
+ 8 Chinese Journal of Infectious Diseases
+ => in fatcat (twice?)
+ => "Zhonghua chuan ran bing za zhi"
+ 7 Chinese Journal of Trauma
+ => no fatcat
+ => in MAG
+ [...]
+
+## DOIs
+
+ jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c
+ => 194 such papers
+
+ jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .source_x -r | sort | uniq -c
+ 176 Elsevier
+ 5 medrxiv
+ 13 WHO
+
+ jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .title -r | sort | uniq -c | sort -nr | head
+ 22 Index
+ 7 Subject Index
+ 2 S
+ 2 C
+ 1 V
+
+Trying a sample of DOIs that didn't match:
+
+ jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | shuf -n10 | jq .doi -r
+
+ 10.1101/2020.01.31.20019935
+ => medrxiv, not sure why not in fatcat
+ 10.1016/j.diagmicrobio.2004.10.002
+ => elsevier, not in fatcat
+
+ 10.1016/B978-0-323-53045-3.00035-0
+ => no such doi
+ 10.1016/B978-0-323-37591-7.00038-0
+ => no such doi
+ 10.1016/B978-0-323-52993-8.00048-5
+ => no such doi
+ 10.1016/B978-0-323-44887-1.00022-5
+ => no such doi
+ 10.1016/B978-343721804-0.50036-3
+ => elsevier, index
+ 10.1016/B978-0-323-44887-1.00044-4
+ => no such doi
+ => https://www.us.elsevierhealth.com/kendigs-disorders-of-the-respiratory-tract-in-children-9780323448871.html
+ 10.1016/B978-0-323-04579-7.00035-6
+ => no such doi
+ 10.1016/B978-0-323-55512-8.00140-X
+ => no such doi
+ => 10.1016/B978-0-323-55512-8.00140-X
+ => DOI is listed on elsevier homepage
+ => https://www.sciencedirect.com/science/article/pii/B978032355512800140X
+
+Many of these "no such DOI" may be partially registered?
+
+Top journals (actually books?) with missing DOIs:
+
+ bnewbold@orithena$ cat missing.json | jq .who_paper -c | jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .journal | sort | uniq -c | sort -nr | head
+ 28 "Mandell, Douglas, and Bennett's Principles and Practice of Infectious Diseases"
+ 13 "Principles and Practice of Pediatric Infectious Diseases"
+ 11 "Infectious Diseases"
+ 9 "Hunter's Tropical Medicine and Emerging Infectious Diseases"
+ 7 "Kendig's Disorders of the Respiratory Tract in Children"
+ 5 ""
+ 4 "Vaccine"
+ 4 "The Dictionary of Cell & Molecular Biology"
+ 4 "Clinical Immunology"
+ 3 "Zakim and Boyer's Hepatology"
+ [...]
+
+## PMID/PMCID
+
+ jq 'select(.doi == "" and .pubmed_id !="")' -c
+ => 8
+
+ jq 'select(.doi == "" and .pmcid !="")' -c
+ => 223
+
+All of these PMIDs and PMCIDs seem to be valid. From a quick scan, they seem to
+not be in fatcat because there are already works there with DOIs.
+
+## Recommendations
+
+In your data munging, filter out:
+
+- works with blank titles and no external identifier (eg, no DOI, PMCID, PMID, MAG ID)
+- works with blocklist titles (see below). I assume these got included due to
+ fulltext search matches, but I think are just noise
+- works with titles which are a single capital letter (eg, "S", "C")
+
+Title blocklist (preceeded by count); usually I do these by lower-casing and
+striping non-alphanumeric characters before comparing:
+
+ 348 Index
+ 83 Subject Index
+ 76 Subject index
+ 69 Author index
+ 68 Contents
+ 67 Articles of Significant Interest Selected from This Issue by the Editors
+ 66 Information for Authors
+ 36 Graphical contents list
+ 29 Table of Contents
+ 21 In brief
+ 20 Preface
+ 20 Editorial Board
+ 19 Author Index
+ 18 Volume Contents
+ 18 Research brief
+ 18 Abstracts
+ 15 Opportunities from the Center for Perioperative Education
+ 13 Keyword index
+ 12 PNAS Plus Significance Statements
+ 11 In This Issue
+ 10 Current Awareness on Comparative and Functional Genomics
+ 9 Introduction
+ 9 Highlights of this issue
+ 9 Contents list
+ 9 Abbreviations
+ 8 QUIZ CORNER
+ 8 Positions available
+ 8 Journal Watch
+ 8 Index of Authors
+ 8 Editorial
+ 8 Cumulative Index
+ 7 Table of contents
+ 7 Quiz Corner
+ 7 Index of Subjects
+ 7 INDEX
+ 7 Foreword
+ 7 Bibliography of the current world literature
+ 6 Viral gastroenteritis
+ 6 Public Health Watch
+ 6 Contributors
+ 6 Contents of Volume
+ 6 Contents of other veterinary journals from Elsevier
+ Highlights of this issue
+ QUIZ CORNER
+ Answers to Quiz Corner
+
+## Notes
+
+There is some chinese journal registry number, eg "ISSN 1005-1201 CN 11-2149/R"
+
+Interesting sites to crawl or translate:
+ http://medjournals.cn/index.do
+ => 1 million papers
+ http://rs.yiigle.com/yufabiao/1181337.htm
+ => paper repository/host?
+ http://subject.med.wanfangdata.com.cn/Channel/7?mark=34
+ => list of papers? wanfang data seems like a large publisher
+ http://www.wanfangdata.com/about/about.asp
+ => mainland china (beijing)
+ => commerical/national holder of 40+ million papers
+ => indexed by EBSCO
+