# V4
Have not release v3, but many change to `skate` so we may continue with v4.
# Unstructured
```
{
"biblio": {
"unstructured": "J. Häger, W. Krieger, T. Rüegg, and H. Walther, J. Chem. Phys. 72, 4286 (1980).JCPSA60021-9606"
},
"index": 8,
"key": "_r4",
"ref_source": "crossref",
"release_year": 1983,
"release_ident": "tebzylkszzbyjggye5ssmebdcy",
"work_ident": "aaaofyp6uzcdnbe7hvfahylyha"
}
```
We should be able to match: "J. Chem. Phys.", also maybe "72, 4286 (1980)", w/
id, title we should be able to match to:
https://fatcat.wiki/release/d2k7en7tzzdzzddwfo5xlqx4ce
If nothing else defined, and unstructured contains a URL, we may extract that.
```
{
"biblio": {
"unstructured": "Friedrich-Ebert-Stiftung (FES) 2008 FES in Nepal. FES http://www.fesnepal.org/about/fes_in_nepal.htm (accessed February 15, 2009)"
},
"index": 19,
"key": "CIT0020",
"ref_source": "crossref",
"release_year": 2011,
"release_ident": "xqgaanhpf5gotdxxxytgyxw2ty",
"work_ident": "aaaq35j3angwzdpzcvzdil3v4y"
}
```
Also, these may say: "accessed at ..."
# URL
* url cleanup in place
# Partial Data Mapping
* how to map partial docs onto a key
# OL beyond ISBN
Example:
```
{
"biblio": {
"container_name": "The Debt: What America Owes to Blacks",
"contrib_raw_names": [
"R Robinson"
],
"unstructured": "Randall Robinson, The Debt: What America Owes to Blacks. New York: Dutton Books, 2000, pp. 219–220.",
"year": 2000
},
"index": 22,
"key": "8_CR23",
"ref_source": "crossref",
"release_year": 2009,
"release_ident": "2igycuiobvhxrcmmrzz6anufuq",
"work_ident": "aaacj23jqbdxvajwj5kc6jpejq"
}
```
* https://openlibrary.org/works/OL488811W/The_debt?edition=debtwhatamerica000robi
However, there is no explicit "subtitle" fields, and in this case, the subtitle is buried in "text":
```
{
"key": "/works/OL488811W",
"text": [
"/works/OL488811W",
"The debt",
"The Debt",
"The Debt ",
"what America owes to Blacks",
"What America Owes to Blacks",
"OL46591M",
"OL7771042M",
"OL7590904M",
"OL3382710M",
"Randall Robinson.",
"2004556979",
"99045728",
"0452282101",
"0525945245",
```
Subtitle in editions.
```
{
"biblio": {
"container_name": "BLACK AFRICA: The Economic and Cultural Basis for a Federated State",
"unstructured": "For details on African Renaissance see Cheikh Anta Diop, BLACK AFRICA: The Economic and Cultural Basis for a Federated State, New Expanded Edition. Trenton, NJ: Africa World Press, 1987.",
"year": 1987
},
"index": 28,
"key": "8_CR29",
"ref_source": "crossref",
"release_year": 2009,
"release_ident": "2igycuiobvhxrcmmrzz6anufuq",
"work_ident": "aaacj23jqbdxvajwj5kc6jpejq"
}
```
## OL Loop
Some do not have an explicit "works" key, but still link to an edition.
* https://openlibrary.org/books/OL10000230M/Parliamentary_Debates_House_Of_Lords_2003-2004?edition=
> An edition of Parliamentary Debates, House Of Lords 2003-2004
Example edition:
```
{
"publishers": [
"Du Temps"
],
"languages": [
{
"key": "/languages/fre"
}
],
"last_modified": {
"type": "/type/datetime",
"value": "2010-04-24T18:46:01.556464"
},
"weight": "5 ounces",
"title": "Les Fleurs bleues de Raymond Queneau",
"identifiers": {
"goodreads": [
"487215"
]
},
"isbn_13": [
"9782842741013"
],
"covers": [
3140044
],
"physical_format": "Paperback",
"isbn_10": [
"2842741013"
],
"publish_date": "January 1, 2000",
"key": "/books/OL12622734M",
"authors": [
{
"key": "/authors/OL3964945A"
}
],
"latest_revision": 5,
"works": [
{
"key": "/works/OL10000008W"
}
],
"type": {
"key": "/type/edition"
},
"physical_dimensions": "8.4 x 5.7 x 0.3 inches",
"revision": 5
}
```
Example Work:
```
{
"title": "Les Fleurs bleues de Raymond Queneau",
"created": {
"type": "/type/datetime",
"value": "2009-12-11T01:57:19.964652"
},
"covers": [
3140044
],
"last_modified": {
"type": "/type/datetime",
"value": "2010-04-28T06:54:19.472104"
},
"latest_revision": 3,
"key": "/works/OL10000008W",
"authors": [
{
"type": "/type/author_role",
"author": {
"key": "/authors/OL3964945A"
}
}
],
"type": {
"key": "/type/work"
},
"revision": 3
}
```
----
## Unmatched
If we exclude any id and title, we'll roughly have the following fields:
```
container_name|contrib_raw_names|year 64064559
unstructured 61711602
container_name|contrib_raw_names|volume|year 49701699
container_name|contrib_raw_names|unstructured|volume|year 36401044
container_name|contrib_raw_names|unstructured|year 26663422
contrib_raw_names|unstructured 16731608
container_name|contrib_raw_names|doi|unstructured|year 14207167
container_name|contrib_raw_names|doi|year 13159340
```
Some examples:
```
{
"biblio": {
"container_name": "Intern. J. Comput. Math.",
"contrib_raw_names": [
"D. Levin"
],
"volume": "B3",
"year": 1973
},
"index": 19,
"key": "PhysRevB.48.6913Cc15R1",
"ref_source": "crossref",
"release_year": 1993,
"release_ident": "i6s6e64n55hh5oned32mdwrs2i",
"work_ident": "aaaeuvgitzfafczctw3bseauri"
}
```
This refers to:
* https://www.tandfonline.com/doi/abs/10.1080/00207167308803075
* 1972, and not 1973, 1993
* https://fatcat.wiki/release/3cstmufhszalvpnppwxjohnnsa
It would help to go from "container name" to "issn", e.g. here: 0020-7160
* https://fatcat.wiki/release/search?q=levin+container_id%3A%22y4k3i2fvabgarkvywismzvy23a%22+year%3A1972
```
$ grep -i "Intern.*J.*Comput.*Math.*" jabbrev.json
{"name": "COMPEL-THE INTERNATIONAL JOURNAL FOR COMPUTATION AND MATHEMATICS IN ELECTRICAL AND ELECTRONIC ENGINEERING", "abbrev": "COMPEL"}
{"name": "INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE", "abbrev": "INT J APPL MATH COMP"}
{"name": "INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS", "abbrev": "INT J COMPUT MATH"}
```
Lookup name in issn:
```
$ zstdcat tmp/data.ndj.zst | grep -i "INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS" | jq .
"@graph": [
{
"@id": "http://id.loc.gov/vocabulary/countries/enk",
"label": "England"
},
{
"@id": "organization/ISSNCenter#_1",
"@type": "http://schema.org/Organization"
},
{
"@id": "resource/ISSN-L/0020-7160",
"identifiedBy": "resource/ISSN/0020-7160#ISSN-L"
},
{
"@id": "resource/ISSN/0020-7160",
"@type": [
"http://id.loc.gov/ontologies/bibframe/Instance",
"http://id.loc.gov/ontologies/bibframe/Work",
"http://schema.org/Periodical"
],
"format": "vocabularies/medium#Print",
"http://purl.org/ontology/bibo/issn": "0020-7160",
"identifiedBy": [
"resource/ISSN/0020-7160#ISSN-L",
"resource/ISSN/0020-7160#ISSN",
"resource/ISSN/0020-7160#KeyTitle"
],
```
We would need:
* rough abbrev name -> full name (jabbrev) -> issn (issnlister) -> container id (fatcat)
Example, title match with OL:
```
{
"biblio": {
"container_name": "Private schooling in less economically developed countries",
"contrib_raw_names": [
"Caddell M."
],
"year": 2008
},
"index": 7,
"key": "CIT0008",
"ref_source": "crossref",
"release_year": 2011,
"release_ident": "xqgaanhpf5gotdxxxytgyxw2ty",
"work_ident": "aaaq35j3angwzdpzcvzdil3v4y"
}
```
A matching OL edition record:
```
{
"publishers": [
"Symposium Books"
],
"languages": [
{
"key": "/languages/eng"
}
],
"number_of_pages": 214,
"subtitle": "Asian and African Perspectives (Oxford Studies in Comparative Education)",
"weight": "12.6 ounces",
"title": "Private Schooling in Less Economically Developed Countries",
"isbn_10": [
"1873927851"
],
"type": {
"key": "/type/edition"
},
"identifiers": {
"goodreads": [
"1078335"
]
},
"isbn_13": [
"9781873927854"
],
"covers": [
3020365
],
"physical_format": "Paperback",
"key": "/books/OL12102259M",
"publish_date": "April 1, 2007",
"contributions": [
"Prachi Srivastava (Editor)",
"Geoffrey Walford (Editor)"
],
"subjects": [
"Organization & management of education",
"ASIA",
"Africa",
"Reference / General"
],
"physical_dimensions": "9.1 x 6.1 x 0.7 inches",
"works": [
{
"key": "/works/OL24081822W"
}
],
"lccn": [
"2007408632"
],
"lc_classifications": [
"LC57.5 .P75 2007"
],
```
----
# Completeness
```
{
"biblio": {
"container_name": "La vida y época de Prebisch",
"year": 2010
},
"index": 5,
"key": "key20191115064515_B6",
"ref_source": "crossref",
"release_year": 2019,
"release_ident": "oc6nhkoah5gcnjfsjpct4ij3ea",
"work_ident": "aaachbf2kbdnxekwdujbmnlw4a"
}
```
* https://fatcat.wiki/release/oc6nhkoah5gcnjfsjpct4ij3ea/references
* https://www.iberoamericana.se/articles/10.16993/iberoamericana.467/galley/445/download/
In the PDF, we find a DOI as well, but it seems to be not extracted. In fact;
the ref data comes from crossref.
Grobid gets the DOI:
```
La vida y época de Prebisch. 1901-1986. Madrid: Marcial Pons
EJDosman
10.18356/40a5d411-es
```
Other issues:
* year vs release_year
```
{
"biblio": {
"container_name": "The Methodology of Scientific Research Programmes",
"year": 1980
},
"index": 13,
"key": "key20191115064515_B14",
"ref_source": "crossref",
"release_year": 2019,
"release_ident": "oc6nhkoah5gcnjfsjpct4ij3ea",
"work_ident": "aaachbf2kbdnxekwdujbmnlw4a"
}
```
## Conservative Verification
* closeby, but different years, although it seems it would actually be a match
```
different year /works/OL13199655W lvtfhk63kjbthacu2aam3jgudu 1000000delinquents 1000000 delinquents 1,000,000 Delinquents
different year /works/OL13199655W wpp46slm6nca7b3nwdwtjbegla 1000000delinquents 1000000 delinquents 1,000,000 Delinquents
different year /works/OL13199655W gvzsp7pz75d6roxrrarduprmie 1000000delinquents 1000000 delinquents 1,000,000 Delinquents
different year /works/OL13199655W ilga2kj4nnaqdh4rmogbsgbgbe 1000000delinquents 1000000 delinquents 1,000,000 Delinquents
different year /works/OL13199655W 5ujpef3vjzhkvmse6ovey2q2zi 1000000delinquents 1000000 delinquents 1,000,000 Delinquents
```
## Journal name augmentation
In ~160M unmatched refs (release format) we could resolve 14M container names, via `skate-resolve-journal-name`.
```
$ zstdcat date-2021-05-06.tsv.zst | skate-resolve-journal-name -B -A /magna/data/jabbrev.json | cut -f 2 | pv -l | LC_ALL=C grep -cF resolved_container_name
2021/06/01 13:02:20 found 27178 abbreviation mappings
160M 0:14:49 [ 180k/s] [ <=> ]
14090677
```
## Discrepancy
* https://fatcat.wiki/release/cgmnjwrhlvccxnxyewd4buuhzm/references
UnmatchedRefs contains entry 11:
```
{
"biblio": {
"container_name": "Med J Aust",
"contrib_raw_names": [
"Dracup K"
],
"volume": "166",
"year": 1997
},
"index": 9,
"key": "bibr11-010740830802800102",
"ref_source": "crossref",
"release_year": 2008,
"release_ident": "cgmnjwrhlvccxnxyewd4buuhzm",
"work_ident": "aabzzlohgza2pfaol7cgqlvpke"
}
```
In frontend, we only have a DOI; https://fatcat.wiki/release/lookup?doi=10.1016/s0147-9563(97)90082-0
## More OL matching
> ran open library and fatcat fuzzy matching (via container name) on all docs,
> that did not have an id-based match; found 139M link candidates, of which
> verification 11M strong or exact matches, of which around 3M had some IA
> identifier (about 200K unique; but looking at a few of them, it seems these were somewhat
> restricted items, e.g. "print-disabled"
Most referenced items were:
```
13010 ia:discoverygrounde00glas
9351 ia:selfefficacyexer0000band
8341 ia:basicsofqualitat0000stra
7562 ia:researchdesignqu00cres
7027 ia:basicsqualitativ00stra
6397 ia:qualitativedataa00mile
5958 ia:briefhistoryneol00harv
5779 ia:constructinggrou00char
5291 ia:reassemblingsoci00lato
4908 ia:econometricanaly0000gree_f5q0
4762 ia:powerknowledges00fouc
4733 ia:stressappraisalc00rich
4673 ia:locationculture00bhab_220
4405 ia:threeworldswelfa00espi
4330 ia:bodiesthatmatter00butl_662
4306 ia:gendertroublefem0000butl_d7d5
4249 ia:contentanalysisi00krip
4059 ia:practiceofeveryd01cert
4013 ia:intermolecularsu00isra
3925 ia:culturesorganiza00hofs
3621 ia:fractalgeometryo00beno
3564 ia:modernityatlarge00appa
3483 ia:modelselectionmu00burn_141
3454 ia:economicinstitut00will
3300 ia:economictheoryof00down_0
3289 ia:seeinglikestateh00scot_250
3204 ia:densityfunctiona00parr
3158 ia:experienceeducat00dewe_0
3127 ia:infraredramanspe00naka
3114 ia:qualitativeinqui00cres_711
3088 ia:sciencehumanbeha00bfsk
3050 ia:structuralequati0000byrn_g1v4
3050 ia:numericalrecipes0000unse_j9c5
3003 ia:conductionheatso00cars
2955 ia:fifthdisciplineasen00seng
2793 ia:homosacersoverei00agam_937
2746 ia:naturalisticinqu00linc
2737 ia:postmoderncondit00lyot_037
2673 ia:mathematicsdiffu00cran
2576 ia:principlespracti0000klin
2530 ia:hydrodynamichydr00chan
2499 ia:strategicmanagem00free
2493 ia:viscoelasticprop00ferr
2454 ia:wehaveneverbeenm00lato_404
2428 ia:foucaultreader00fouc
2386 ia:languagesymbolic0000bour_1991
2381 ia:weaponsofweakeve0000scot
2380 ia:greattransforma000pola
2370 ia:crossingqualityc00amer_984
```
## Glitch in GS?
The fractal geometry of nature is "cited by 47428"
(https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22fractal+geometry+of+nature%22&btnG=);
on page one of references, there is "Gaussian processes in machine learning"
(2003), via:
https://www.researchgate.net/profile/Olivier_Bousquet/publication/238718428_Advanced_Lectures_on_Machine_Learning_ML_Summer_Schools_2003_Canberra_Australia_February_2-14_2003_Tubingen_Germany_August_4-16_2003_Revised_Lectures/links/02e7e52c5870850311000000/Advanced-Lectures-on-Machine-Learning-ML-Summer-Schools-2003-Canberra-Australia-February-2-14-2003-Tuebingen-Germany-August-4-16-2003-Revised-Lectures.pdf#page=70
- the paper itself does not contain a reference -- in the whole document.
## OL fuzzy different
Reasons, why pairs were marked as *different*:
```
$ zstdcat -T UnmatchedOpenLibraryMatchTable/date-2021-05-06.tsv.zst | grep ^different | cut -f2 | LC_ALL=C sort -S50% | uniq -c | sort -nr
47324670 year
46016349 contribintersectionempty
582618 pagecount
460 titlefilename
25 numdiff
```
The `year` may refer to different editions:
* https://fatcat.wiki/release/kngofkvoo5cinj4wqerrey4tpi/references
* https://openlibrary.org/works/OL16286792W/One_hundred_and_seventeen_days?edition=onehundredsevent00firs
> 117 Days: An Account of Confinement and Interrogation under the South African
> 90-Day Detention Law.2006 | vs This edition was published in 1965 by Penguin
> Books
## Data mismatch
* FE: https://fatcat.wiki/release/niivpohpabhajdsf35x7hr4efm/references, [8]: 2011
refs (2017 only)
```
{
"container_name": "19 & 20: Notes for a New Social Protagonism",
"container": {
"container_type": "",
"ident": "",
"issnl": "",
"name": "",
"publisher": "",
"revision": "",
"state": "",
"wikidata_qid": ""
},
"contribs": [
{
"raw_name": "Colective Situaciones"
}
],
"ext_ids": {},
"ident": "niivpohpabhajdsf35x7hr4efm",
"release_year": "2017",
"work_id": "7eghl5lcivfmha6d4uavrrkpce",
"extra": {
"crossref": {},
"datacite": {},
"skate": {
"status": "ref",
"ref": {
"index": 7,
"key": "\nkey\n\t\t\t\t20171225032503_CIT0007"
},
"rg": {},
"resolved_container_name": ""
},
"ol": {}
}
}
{
"container_name": "A Dictionary of Marxist Thought (2nd ed.)",
"container": {
"container_type": "",
"ident": "",
"issnl": "",
"name": "",
"publisher": "",
"revision": "",
"state": "",
"wikidata_qid": ""
},
"ext_ids": {},
"ident": "niivpohpabhajdsf35x7hr4efm",
"release_year": "2017",
"title": "Price of production and the transformation problem",
"work_id": "7eghl5lcivfmha6d4uavrrkpce",
"extra": {
"crossref": {},
"datacite": {},
"skate": {
"status": "ref",
"ref": {
"index": 12,
"key": "\nkey\n\t\t\t\t20171225032503_CIT0012"
},
"rg": {},
"resolved_container_name": ""
},
"ol": {}
}
}
{
"container_name": "A Grammar of the Multitude: For an Analysis of Contemporary Forms of Life",
"container": {
"container_type": "",
"ident": "",
"issnl": "",
"name": "",
"publisher": "",
"revision": "",
"state": "",
"wikidata_qid": ""
},
"ext_ids": {},
"ident": "niivpohpabhajdsf35x7hr4efm",
"release_year": "2017",
"work_id": "7eghl5lcivfmha6d4uavrrkpce",
"extra": {
"crossref": {},
"datacite": {},
"skate": {
"status": "ref",
"ref": {
"index": 45,
"key": "\nkey\n\t\t\t\t20171225032503_CIT0044"
},
"rg": {},
"resolved_container_name": ""
},
"ol": {}
}
}
{
"container_name": "An Introduction to the Three Volumes of Karl Marx's Capital",
"container": {
"container_type": "",
"ident": "",
"issnl": "",
"name": "",
"publisher": "",
"revision": "",
"state": "",
"wikidata_qid": ""
},
"ext_ids": {},
"ident": "niivpohpabhajdsf35x7hr4efm",
"release_year": "2017",
"work_id": "7eghl5lcivfmha6d4uavrrkpce",
"extra": {
"crossref": {},
"datacite": {},
"skate": {
"status": "ref",
"ref": {
"index": 21,
"key": "\nkey\n\t\t\t\t20171225032503_CIT0020"
},
"rg": {},
"resolved_container_name": ""
},
"ol": {}
}
}
```
## Grobid misses (ISBN)
* PDF: https://web.archive.org/web/20031204233716/http://grace.wharton.upenn.edu:80/~sok/sokpapers/1999-0/indiana-transparency/flbc-transparency.pdf
* seems grobid does not recognize ISBN?
```
Electronic data interchange in logistics
MargaretAEmmelhainz
The Logistics Handbook
James F. Robeson and William C. Copacino
New York, NY
The Free Press
WordNet: An Electronic Lexical Database
Christiane Fellbaum
The MIT Press
Cambridge, MA
```