From 19aaf791c645d651da4736d081f8a75cb67832f6 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 24 Mar 2020 18:43:41 -0700 Subject: notes on missing papers --- notes/missing_2020-03-20.md | 282 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 282 insertions(+) create mode 100644 notes/missing_2020-03-20.md diff --git a/notes/missing_2020-03-20.md b/notes/missing_2020-03-20.md new file mode 100644 index 0000000..fa2705b --- /dev/null +++ b/notes/missing_2020-03-20.md @@ -0,0 +1,282 @@ + +Hand investigation of records that don't match to fatcat metadata. + +## Overview + + 811 such records from 2020-03-20 CORD-19 dataset. + 221 have some DOI + 242 have a PMCID + 12 have a PMID + + jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | wc -l + => 364 + + jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .source_x -r | sort | uniq -c | sort -nr + 288 WHO + 76 Elsevier + + jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq .title | sort | uniq -c | head + => 62 have empty titles, all from Elsevier + + jq .title | sort | uniq -c | head + => from full batch, 224 have empty titles; no duplicates + 214 Elsevier + 9 PMC + 1 PMC_new + +## Lacking Identifiers, Has Title + + jq 'select(.doi == "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c + => 302 such papers + +Random sample 10: + + "Surgical management strategies for orthopedic trauma patients under epidemic of novel coronavirus pneumonia" + Chinese Journal of Trauma, 2020 + => Volume: 36, Issue: 2, pp 124-128 + => https://academic.microsoft.com/paper/3009524916/related + + "Molecular biology of microbial pathogenicity: Adhesion, invasion and receptors" + FEMS Microbiology Letters, 1984 + => elsevier + => missing DOI + => free PDF, hard to crawl (OUP) + => https://fatcat.wiki/release/wbzyyyuhcfhnjfmuouygqyi7vy + + "Recommendations for general surgery clinical practice in novel coronavirus pneumonia situation" + Chinese Journal of Surgery, 2020 + => missing DOI and PMID + => DOI not in crossref + => free PDF download + => https://fatcat.wiki/release/3przclac6jfarpru2ol4xafmie + + "Author index volume 20 (1983)" + FEMS Microbiology Letters, 1983 + => stub + + "Paying close attention to diabetic patients with novel coronavirus infection" + Chinese Journal of Endocrinology and Metabolism, 2020 + => https://academic.microsoft.com/paper/3010104673/related + => no fatcat + + "The recommendation for management of the bereavements among the family members died with novel coronavirus pneumonia" + Chinese Journal of Behavioral Medicine and Brain Science, 2020 + => https://academic.microsoft.com/paper/3009517691/related + => 1674-6554 + => 10.3760/cma.j.issn.1674-6554.2020.... (?) + => there are english and chinese websites for journal, but chinese more up-to-date + => http://med.wanfangdata.com.cn/Periodical/zgxwyxkx + => publisher resources: + http://subject.med.wanfangdata.com.cn/Channel/7 + https://translate.googleusercontent.com/translate_c?depth=1&rurl=translate.google.com&sl=auto&source=osdd&sp=nmt4&tl=en&u=http://subject.med.wanfangdata.com.cn/Channel/7&usg=ALkJrhj0J6IQM6RKNFYPKpMshQS9UcS2oQ + + "Singapore claims first use of antibody test to track coronavirus infections | Science | AAAS" + Science Magazine, 2020 + => https://fatcat.wiki/release/ovghasgr6bclbj2ksbnaz635ci + => simply missing DOI + + "High resolution CT findings and clinical features of novel coronavirus pneumonia in Guangzhou" + Chinese Journal of Radiology, 2020 + => https://academic.microsoft.com/paper/3010637392/related + => http://www.cjrjournal.org/ + => https://fatcat.wiki/container/grktk23p5rayfncvh3bm6ylbwy (old only) + => not seeing this paper in any recent (2020) issues + + "COVID-19 Update From China" + JAMA, 2020 + => seems to be an audio recording and/or video on "JN Learning" + => https://edhub.ama-assn.org/jn-learning/audio-player/18234306 + => https://edhub.ama-assn.org/jn-learning/video-player/18234510 + => abstract matches + + "Coronavirus latest: death toll passes 2,000" + Nature, 2020 + => this is a commentary article which is getting continuously updated, with the title changing + => DOI: 10.1038/d41586-020-00154-w + => do have a wayback snapshot with exact title: + => http://web.archive.org/web/20200220185055/https://www.nature.com/articles/d41586-020-00154-w + +Summary: +- some informal material mixed in (commentary/opinion/video) +- bunch of chinese papers, often with DOI, not indexed in western databases, + but are in MAG. MAG identifiers mostly not included. could get metadata from + wanfang data (company)? +- some stub articles: eg, indexes +- some exact title matches, missing identifier (pubmed_id, DOI) matches + +Top journals: + + 46 (none) + 15 Chinese Journal of Laboratory Medicine + => fatcat container, no papers + 14 FEMS Microbiology Letters + => fatcat container, OUP, many papers + 11 Chinese Journal of Medical Science Research Management + => no fatcat + => 1006-1924 + => https://portal.issn.org/resource/ISSN/1006-1924 + => "Zhonghua yixue keyan guanli zazhi" + 11 Chinese Journal of Hospital Administration + 10 Chinese Journal of Radiology + 10 Chinese Journal of Emergency Medicine + 9 Chinese Journal of Preventive Medicine + 9 Chinese Journal of Experimental Ophthalmology + 9 Chinese Critical Care Medicine + 8 Chongqing Medicine + 8 Chinese Journal of Infectious Diseases + => in fatcat (twice?) + => "Zhonghua chuan ran bing za zhi" + 7 Chinese Journal of Trauma + => no fatcat + => in MAG + [...] + +## DOIs + + jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c + => 194 such papers + + jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .source_x -r | sort | uniq -c + 176 Elsevier + 5 medrxiv + 13 WHO + + jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .title -r | sort | uniq -c | sort -nr | head + 22 Index + 7 Subject Index + 2 S + 2 C + 1 V + +Trying a sample of DOIs that didn't match: + + jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | shuf -n10 | jq .doi -r + + 10.1101/2020.01.31.20019935 + => medrxiv, not sure why not in fatcat + 10.1016/j.diagmicrobio.2004.10.002 + => elsevier, not in fatcat + + 10.1016/B978-0-323-53045-3.00035-0 + => no such doi + 10.1016/B978-0-323-37591-7.00038-0 + => no such doi + 10.1016/B978-0-323-52993-8.00048-5 + => no such doi + 10.1016/B978-0-323-44887-1.00022-5 + => no such doi + 10.1016/B978-343721804-0.50036-3 + => elsevier, index + 10.1016/B978-0-323-44887-1.00044-4 + => no such doi + => https://www.us.elsevierhealth.com/kendigs-disorders-of-the-respiratory-tract-in-children-9780323448871.html + 10.1016/B978-0-323-04579-7.00035-6 + => no such doi + 10.1016/B978-0-323-55512-8.00140-X + => no such doi + => 10.1016/B978-0-323-55512-8.00140-X + => DOI is listed on elsevier homepage + => https://www.sciencedirect.com/science/article/pii/B978032355512800140X + +Many of these "no such DOI" may be partially registered? + +Top journals (actually books?) with missing DOIs: + + bnewbold@orithena$ cat missing.json | jq .who_paper -c | jq 'select(.doi != "" and .pubmed_id =="" and .pmcid == "")' -c | jq 'select(.title != "")' -c | jq .journal | sort | uniq -c | sort -nr | head + 28 "Mandell, Douglas, and Bennett's Principles and Practice of Infectious Diseases" + 13 "Principles and Practice of Pediatric Infectious Diseases" + 11 "Infectious Diseases" + 9 "Hunter's Tropical Medicine and Emerging Infectious Diseases" + 7 "Kendig's Disorders of the Respiratory Tract in Children" + 5 "" + 4 "Vaccine" + 4 "The Dictionary of Cell & Molecular Biology" + 4 "Clinical Immunology" + 3 "Zakim and Boyer's Hepatology" + [...] + +## PMID/PMCID + + jq 'select(.doi == "" and .pubmed_id !="")' -c + => 8 + + jq 'select(.doi == "" and .pmcid !="")' -c + => 223 + +All of these PMIDs and PMCIDs seem to be valid. From a quick scan, they seem to +not be in fatcat because there are already works there with DOIs. + +## Recommendations + +In your data munging, filter out: + +- works with blank titles and no external identifier (eg, no DOI, PMCID, PMID, MAG ID) +- works with blocklist titles (see below). I assume these got included due to + fulltext search matches, but I think are just noise +- works with titles which are a single capital letter (eg, "S", "C") + +Title blocklist (preceeded by count); usually I do these by lower-casing and +striping non-alphanumeric characters before comparing: + + 348 Index + 83 Subject Index + 76 Subject index + 69 Author index + 68 Contents + 67 Articles of Significant Interest Selected from This Issue by the Editors + 66 Information for Authors + 36 Graphical contents list + 29 Table of Contents + 21 In brief + 20 Preface + 20 Editorial Board + 19 Author Index + 18 Volume Contents + 18 Research brief + 18 Abstracts + 15 Opportunities from the Center for Perioperative Education + 13 Keyword index + 12 PNAS Plus Significance Statements + 11 In This Issue + 10 Current Awareness on Comparative and Functional Genomics + 9 Introduction + 9 Highlights of this issue + 9 Contents list + 9 Abbreviations + 8 QUIZ CORNER + 8 Positions available + 8 Journal Watch + 8 Index of Authors + 8 Editorial + 8 Cumulative Index + 7 Table of contents + 7 Quiz Corner + 7 Index of Subjects + 7 INDEX + 7 Foreword + 7 Bibliography of the current world literature + 6 Viral gastroenteritis + 6 Public Health Watch + 6 Contributors + 6 Contents of Volume + 6 Contents of other veterinary journals from Elsevier + Highlights of this issue + QUIZ CORNER + Answers to Quiz Corner + +## Notes + +There is some chinese journal registry number, eg "ISSN 1005-1201 CN 11-2149/R" + +Interesting sites to crawl or translate: + http://medjournals.cn/index.do + => 1 million papers + http://rs.yiigle.com/yufabiao/1181337.htm + => paper repository/host? + http://subject.med.wanfangdata.com.cn/Channel/7?mark=34 + => list of papers? wanfang data seems like a large publisher + http://www.wanfangdata.com/about/about.asp + => mainland china (beijing) + => commerical/national holder of 40+ million papers + => indexed by EBSCO + -- cgit v1.2.3