# V4 Have not release v3, but many change to `skate` so we may continue with v4. # Unstructured ``` { "biblio": { "unstructured": "J. Häger, W. Krieger, T. Rüegg, and H. Walther, J. Chem. Phys. 72, 4286 (1980).JCPSA60021-9606" }, "index": 8, "key": "_r4", "ref_source": "crossref", "release_year": 1983, "release_ident": "tebzylkszzbyjggye5ssmebdcy", "work_ident": "aaaofyp6uzcdnbe7hvfahylyha" } ``` We should be able to match: "J. Chem. Phys.", also maybe "72, 4286 (1980)", w/ id, title we should be able to match to: https://fatcat.wiki/release/d2k7en7tzzdzzddwfo5xlqx4ce If nothing else defined, and unstructured contains a URL, we may extract that. ``` { "biblio": { "unstructured": "Friedrich-Ebert-Stiftung (FES) 2008 FES in Nepal. FES http://www.fesnepal.org/about/fes_in_nepal.htm (accessed February 15, 2009)" }, "index": 19, "key": "CIT0020", "ref_source": "crossref", "release_year": 2011, "release_ident": "xqgaanhpf5gotdxxxytgyxw2ty", "work_ident": "aaaq35j3angwzdpzcvzdil3v4y" } ``` Also, these may say: "accessed at ..." # URL * url cleanup in place # Partial Data Mapping * how to map partial docs onto a key # OL beyond ISBN Example: ``` { "biblio": { "container_name": "The Debt: What America Owes to Blacks", "contrib_raw_names": [ "R Robinson" ], "unstructured": "Randall Robinson, The Debt: What America Owes to Blacks. New York: Dutton Books, 2000, pp. 219–220.", "year": 2000 }, "index": 22, "key": "8_CR23", "ref_source": "crossref", "release_year": 2009, "release_ident": "2igycuiobvhxrcmmrzz6anufuq", "work_ident": "aaacj23jqbdxvajwj5kc6jpejq" } ``` * https://openlibrary.org/works/OL488811W/The_debt?edition=debtwhatamerica000robi However, there is no explicit "subtitle" fields, and in this case, the subtitle is buried in "text": ``` { "key": "/works/OL488811W", "text": [ "/works/OL488811W", "The debt", "The Debt", "The Debt ", "what America owes to Blacks", "What America Owes to Blacks", "OL46591M", "OL7771042M", "OL7590904M", "OL3382710M", "Randall Robinson.", "2004556979", "99045728", "0452282101", "0525945245", ``` Subtitle in editions. ``` { "biblio": { "container_name": "BLACK AFRICA: The Economic and Cultural Basis for a Federated State", "unstructured": "For details on African Renaissance see Cheikh Anta Diop, BLACK AFRICA: The Economic and Cultural Basis for a Federated State, New Expanded Edition. Trenton, NJ: Africa World Press, 1987.", "year": 1987 }, "index": 28, "key": "8_CR29", "ref_source": "crossref", "release_year": 2009, "release_ident": "2igycuiobvhxrcmmrzz6anufuq", "work_ident": "aaacj23jqbdxvajwj5kc6jpejq" } ``` ## OL Loop Some do not have an explicit "works" key, but still link to an edition. * https://openlibrary.org/books/OL10000230M/Parliamentary_Debates_House_Of_Lords_2003-2004?edition= > An edition of Parliamentary Debates, House Of Lords 2003-2004 Example edition: ``` { "publishers": [ "Du Temps" ], "languages": [ { "key": "/languages/fre" } ], "last_modified": { "type": "/type/datetime", "value": "2010-04-24T18:46:01.556464" }, "weight": "5 ounces", "title": "Les Fleurs bleues de Raymond Queneau", "identifiers": { "goodreads": [ "487215" ] }, "isbn_13": [ "9782842741013" ], "covers": [ 3140044 ], "physical_format": "Paperback", "isbn_10": [ "2842741013" ], "publish_date": "January 1, 2000", "key": "/books/OL12622734M", "authors": [ { "key": "/authors/OL3964945A" } ], "latest_revision": 5, "works": [ { "key": "/works/OL10000008W" } ], "type": { "key": "/type/edition" }, "physical_dimensions": "8.4 x 5.7 x 0.3 inches", "revision": 5 } ``` Example Work: ``` { "title": "Les Fleurs bleues de Raymond Queneau", "created": { "type": "/type/datetime", "value": "2009-12-11T01:57:19.964652" }, "covers": [ 3140044 ], "last_modified": { "type": "/type/datetime", "value": "2010-04-28T06:54:19.472104" }, "latest_revision": 3, "key": "/works/OL10000008W", "authors": [ { "type": "/type/author_role", "author": { "key": "/authors/OL3964945A" } } ], "type": { "key": "/type/work" }, "revision": 3 } ``` ---- ## Unmatched If we exclude any id and title, we'll roughly have the following fields: ``` container_name|contrib_raw_names|year 64064559 unstructured 61711602 container_name|contrib_raw_names|volume|year 49701699 container_name|contrib_raw_names|unstructured|volume|year 36401044 container_name|contrib_raw_names|unstructured|year 26663422 contrib_raw_names|unstructured 16731608 container_name|contrib_raw_names|doi|unstructured|year 14207167 container_name|contrib_raw_names|doi|year 13159340 ``` Some examples: ``` { "biblio": { "container_name": "Intern. J. Comput. Math.", "contrib_raw_names": [ "D. Levin" ], "volume": "B3", "year": 1973 }, "index": 19, "key": "PhysRevB.48.6913Cc15R1", "ref_source": "crossref", "release_year": 1993, "release_ident": "i6s6e64n55hh5oned32mdwrs2i", "work_ident": "aaaeuvgitzfafczctw3bseauri" } ``` This refers to: * https://www.tandfonline.com/doi/abs/10.1080/00207167308803075 * 1972, and not 1973, 1993 * https://fatcat.wiki/release/3cstmufhszalvpnppwxjohnnsa It would help to go from "container name" to "issn", e.g. here: 0020-7160 * https://fatcat.wiki/release/search?q=levin+container_id%3A%22y4k3i2fvabgarkvywismzvy23a%22+year%3A1972 ``` $ grep -i "Intern.*J.*Comput.*Math.*" jabbrev.json {"name": "COMPEL-THE INTERNATIONAL JOURNAL FOR COMPUTATION AND MATHEMATICS IN ELECTRICAL AND ELECTRONIC ENGINEERING", "abbrev": "COMPEL"} {"name": "INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE", "abbrev": "INT J APPL MATH COMP"} {"name": "INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS", "abbrev": "INT J COMPUT MATH"} ``` Lookup name in issn: ``` $ zstdcat tmp/data.ndj.zst | grep -i "INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS" | jq . "@graph": [ { "@id": "http://id.loc.gov/vocabulary/countries/enk", "label": "England" }, { "@id": "organization/ISSNCenter#_1", "@type": "http://schema.org/Organization" }, { "@id": "resource/ISSN-L/0020-7160", "identifiedBy": "resource/ISSN/0020-7160#ISSN-L" }, { "@id": "resource/ISSN/0020-7160", "@type": [ "http://id.loc.gov/ontologies/bibframe/Instance", "http://id.loc.gov/ontologies/bibframe/Work", "http://schema.org/Periodical" ], "format": "vocabularies/medium#Print", "http://purl.org/ontology/bibo/issn": "0020-7160", "identifiedBy": [ "resource/ISSN/0020-7160#ISSN-L", "resource/ISSN/0020-7160#ISSN", "resource/ISSN/0020-7160#KeyTitle" ], ``` We would need: * rough abbrev name -> full name (jabbrev) -> issn (issnlister) -> container id (fatcat) Example, title match with OL: ``` { "biblio": { "container_name": "Private schooling in less economically developed countries", "contrib_raw_names": [ "Caddell M." ], "year": 2008 }, "index": 7, "key": "CIT0008", "ref_source": "crossref", "release_year": 2011, "release_ident": "xqgaanhpf5gotdxxxytgyxw2ty", "work_ident": "aaaq35j3angwzdpzcvzdil3v4y" } ``` A matching OL edition record: ``` { "publishers": [ "Symposium Books" ], "languages": [ { "key": "/languages/eng" } ], "number_of_pages": 214, "subtitle": "Asian and African Perspectives (Oxford Studies in Comparative Education)", "weight": "12.6 ounces", "title": "Private Schooling in Less Economically Developed Countries", "isbn_10": [ "1873927851" ], "type": { "key": "/type/edition" }, "identifiers": { "goodreads": [ "1078335" ] }, "isbn_13": [ "9781873927854" ], "covers": [ 3020365 ], "physical_format": "Paperback", "key": "/books/OL12102259M", "publish_date": "April 1, 2007", "contributions": [ "Prachi Srivastava (Editor)", "Geoffrey Walford (Editor)" ], "subjects": [ "Organization & management of education", "ASIA", "Africa", "Reference / General" ], "physical_dimensions": "9.1 x 6.1 x 0.7 inches", "works": [ { "key": "/works/OL24081822W" } ], "lccn": [ "2007408632" ], "lc_classifications": [ "LC57.5 .P75 2007" ], ``` ---- # Completeness ``` { "biblio": { "container_name": "La vida y época de Prebisch", "year": 2010 }, "index": 5, "key": "key20191115064515_B6", "ref_source": "crossref", "release_year": 2019, "release_ident": "oc6nhkoah5gcnjfsjpct4ij3ea", "work_ident": "aaachbf2kbdnxekwdujbmnlw4a" } ``` * https://fatcat.wiki/release/oc6nhkoah5gcnjfsjpct4ij3ea/references * https://www.iberoamericana.se/articles/10.16993/iberoamericana.467/galley/445/download/ In the PDF, we find a DOI as well, but it seems to be not extracted. In fact; the ref data comes from crossref. Grobid gets the DOI: ``` La vida y época de Prebisch. 1901-1986. Madrid: Marcial Pons EJDosman 10.18356/40a5d411-es ``` Other issues: * year vs release_year ``` { "biblio": { "container_name": "The Methodology of Scientific Research Programmes", "year": 1980 }, "index": 13, "key": "key20191115064515_B14", "ref_source": "crossref", "release_year": 2019, "release_ident": "oc6nhkoah5gcnjfsjpct4ij3ea", "work_ident": "aaachbf2kbdnxekwdujbmnlw4a" } ``` ## Conservative Verification * closeby, but different years, although it seems it would actually be a match ``` different year /works/OL13199655W lvtfhk63kjbthacu2aam3jgudu 1000000delinquents 1000000 delinquents 1,000,000 Delinquents different year /works/OL13199655W wpp46slm6nca7b3nwdwtjbegla 1000000delinquents 1000000 delinquents 1,000,000 Delinquents different year /works/OL13199655W gvzsp7pz75d6roxrrarduprmie 1000000delinquents 1000000 delinquents 1,000,000 Delinquents different year /works/OL13199655W ilga2kj4nnaqdh4rmogbsgbgbe 1000000delinquents 1000000 delinquents 1,000,000 Delinquents different year /works/OL13199655W 5ujpef3vjzhkvmse6ovey2q2zi 1000000delinquents 1000000 delinquents 1,000,000 Delinquents ``` ## Journal name augmentation In ~160M unmatched refs (release format) we could resolve 14M container names, via `skate-resolve-journal-name`. ``` $ zstdcat date-2021-05-06.tsv.zst | skate-resolve-journal-name -B -A /magna/data/jabbrev.json | cut -f 2 | pv -l | LC_ALL=C grep -cF resolved_container_name 2021/06/01 13:02:20 found 27178 abbreviation mappings 160M 0:14:49 [ 180k/s] [ <=> ] 14090677 ``` ## Discrepancy * https://fatcat.wiki/release/cgmnjwrhlvccxnxyewd4buuhzm/references UnmatchedRefs contains entry 11: ``` { "biblio": { "container_name": "Med J Aust", "contrib_raw_names": [ "Dracup K" ], "volume": "166", "year": 1997 }, "index": 9, "key": "bibr11-010740830802800102", "ref_source": "crossref", "release_year": 2008, "release_ident": "cgmnjwrhlvccxnxyewd4buuhzm", "work_ident": "aabzzlohgza2pfaol7cgqlvpke" } ``` In frontend, we only have a DOI; https://fatcat.wiki/release/lookup?doi=10.1016/s0147-9563(97)90082-0 ## More OL matching > ran open library and fatcat fuzzy matching (via container name) on all docs, > that did not have an id-based match; found 139M link candidates, of which > verification 11M strong or exact matches, of which around 3M had some IA > identifier (about 200K unique; but looking at a few of them, it seems these were somewhat > restricted items, e.g. "print-disabled" Most referenced items were: ``` 13010 ia:discoverygrounde00glas 9351 ia:selfefficacyexer0000band 8341 ia:basicsofqualitat0000stra 7562 ia:researchdesignqu00cres 7027 ia:basicsqualitativ00stra 6397 ia:qualitativedataa00mile 5958 ia:briefhistoryneol00harv 5779 ia:constructinggrou00char 5291 ia:reassemblingsoci00lato 4908 ia:econometricanaly0000gree_f5q0 4762 ia:powerknowledges00fouc 4733 ia:stressappraisalc00rich 4673 ia:locationculture00bhab_220 4405 ia:threeworldswelfa00espi 4330 ia:bodiesthatmatter00butl_662 4306 ia:gendertroublefem0000butl_d7d5 4249 ia:contentanalysisi00krip 4059 ia:practiceofeveryd01cert 4013 ia:intermolecularsu00isra 3925 ia:culturesorganiza00hofs 3621 ia:fractalgeometryo00beno 3564 ia:modernityatlarge00appa 3483 ia:modelselectionmu00burn_141 3454 ia:economicinstitut00will 3300 ia:economictheoryof00down_0 3289 ia:seeinglikestateh00scot_250 3204 ia:densityfunctiona00parr 3158 ia:experienceeducat00dewe_0 3127 ia:infraredramanspe00naka 3114 ia:qualitativeinqui00cres_711 3088 ia:sciencehumanbeha00bfsk 3050 ia:structuralequati0000byrn_g1v4 3050 ia:numericalrecipes0000unse_j9c5 3003 ia:conductionheatso00cars 2955 ia:fifthdisciplineasen00seng 2793 ia:homosacersoverei00agam_937 2746 ia:naturalisticinqu00linc 2737 ia:postmoderncondit00lyot_037 2673 ia:mathematicsdiffu00cran 2576 ia:principlespracti0000klin 2530 ia:hydrodynamichydr00chan 2499 ia:strategicmanagem00free 2493 ia:viscoelasticprop00ferr 2454 ia:wehaveneverbeenm00lato_404 2428 ia:foucaultreader00fouc 2386 ia:languagesymbolic0000bour_1991 2381 ia:weaponsofweakeve0000scot 2380 ia:greattransforma000pola 2370 ia:crossingqualityc00amer_984 ``` ## Glitch in GS? The fractal geometry of nature is "cited by 47428" (https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22fractal+geometry+of+nature%22&btnG=); on page one of references, there is "Gaussian processes in machine learning" (2003), via: https://www.researchgate.net/profile/Olivier_Bousquet/publication/238718428_Advanced_Lectures_on_Machine_Learning_ML_Summer_Schools_2003_Canberra_Australia_February_2-14_2003_Tubingen_Germany_August_4-16_2003_Revised_Lectures/links/02e7e52c5870850311000000/Advanced-Lectures-on-Machine-Learning-ML-Summer-Schools-2003-Canberra-Australia-February-2-14-2003-Tuebingen-Germany-August-4-16-2003-Revised-Lectures.pdf#page=70 - the paper itself does not contain a reference -- in the whole document. ## OL fuzzy different Reasons, why pairs were marked as *different*: ``` $ zstdcat -T UnmatchedOpenLibraryMatchTable/date-2021-05-06.tsv.zst | grep ^different | cut -f2 | LC_ALL=C sort -S50% | uniq -c | sort -nr 47324670 year 46016349 contribintersectionempty 582618 pagecount 460 titlefilename 25 numdiff ``` The `year` may refer to different editions: * https://fatcat.wiki/release/kngofkvoo5cinj4wqerrey4tpi/references * https://openlibrary.org/works/OL16286792W/One_hundred_and_seventeen_days?edition=onehundredsevent00firs > 117 Days: An Account of Confinement and Interrogation under the South African > 90-Day Detention Law.2006 | vs This edition was published in 1965 by Penguin > Books ## Data mismatch * FE: https://fatcat.wiki/release/niivpohpabhajdsf35x7hr4efm/references, [8]: 2011 refs (2017 only) ``` { "container_name": "19 & 20: Notes for a New Social Protagonism", "container": { "container_type": "", "ident": "", "issnl": "", "name": "", "publisher": "", "revision": "", "state": "", "wikidata_qid": "" }, "contribs": [ { "raw_name": "Colective Situaciones" } ], "ext_ids": {}, "ident": "niivpohpabhajdsf35x7hr4efm", "release_year": "2017", "work_id": "7eghl5lcivfmha6d4uavrrkpce", "extra": { "crossref": {}, "datacite": {}, "skate": { "status": "ref", "ref": { "index": 7, "key": "\nkey\n\t\t\t\t20171225032503_CIT0007" }, "rg": {}, "resolved_container_name": "" }, "ol": {} } } { "container_name": "A Dictionary of Marxist Thought (2nd ed.)", "container": { "container_type": "", "ident": "", "issnl": "", "name": "", "publisher": "", "revision": "", "state": "", "wikidata_qid": "" }, "ext_ids": {}, "ident": "niivpohpabhajdsf35x7hr4efm", "release_year": "2017", "title": "Price of production and the transformation problem", "work_id": "7eghl5lcivfmha6d4uavrrkpce", "extra": { "crossref": {}, "datacite": {}, "skate": { "status": "ref", "ref": { "index": 12, "key": "\nkey\n\t\t\t\t20171225032503_CIT0012" }, "rg": {}, "resolved_container_name": "" }, "ol": {} } } { "container_name": "A Grammar of the Multitude: For an Analysis of Contemporary Forms of Life", "container": { "container_type": "", "ident": "", "issnl": "", "name": "", "publisher": "", "revision": "", "state": "", "wikidata_qid": "" }, "ext_ids": {}, "ident": "niivpohpabhajdsf35x7hr4efm", "release_year": "2017", "work_id": "7eghl5lcivfmha6d4uavrrkpce", "extra": { "crossref": {}, "datacite": {}, "skate": { "status": "ref", "ref": { "index": 45, "key": "\nkey\n\t\t\t\t20171225032503_CIT0044" }, "rg": {}, "resolved_container_name": "" }, "ol": {} } } { "container_name": "An Introduction to the Three Volumes of Karl Marx's Capital", "container": { "container_type": "", "ident": "", "issnl": "", "name": "", "publisher": "", "revision": "", "state": "", "wikidata_qid": "" }, "ext_ids": {}, "ident": "niivpohpabhajdsf35x7hr4efm", "release_year": "2017", "work_id": "7eghl5lcivfmha6d4uavrrkpce", "extra": { "crossref": {}, "datacite": {}, "skate": { "status": "ref", "ref": { "index": 21, "key": "\nkey\n\t\t\t\t20171225032503_CIT0020" }, "rg": {}, "resolved_container_name": "" }, "ol": {} } } ``` ## Grobid misses (ISBN) * PDF: https://web.archive.org/web/20031204233716/http://grace.wharton.upenn.edu:80/~sok/sokpapers/1999-0/indiana-transparency/flbc-transparency.pdf * seems grobid does not recognize ISBN? ``` Electronic data interchange in logistics MargaretAEmmelhainz The Logistics Handbook James F. Robeson and William C. Copacino
New York, NY
The Free Press
WordNet: An Electronic Lexical Database Christiane Fellbaum The MIT Press Cambridge, MA ```