# V4

Have not release v3, but many change to `skate` so we may continue with v4.

# Unstructured

```
{
  "biblio": {
    "unstructured": "J. Häger, W. Krieger, T. Rüegg, and H. Walther, J. Chem. Phys. 72, 4286 (1980).JCPSA60021-9606"
  },
  "index": 8,
  "key": "_r4",
  "ref_source": "crossref",
  "release_year": 1983,
  "release_ident": "tebzylkszzbyjggye5ssmebdcy",
  "work_ident": "aaaofyp6uzcdnbe7hvfahylyha"
}
```

We should be able to match: "J. Chem. Phys.", also maybe "72, 4286 (1980)", w/
id, title we should be able to match to:
https://fatcat.wiki/release/d2k7en7tzzdzzddwfo5xlqx4ce

If nothing else defined, and unstructured contains a URL, we may extract that.

```
{
  "biblio": {
    "unstructured": "Friedrich-Ebert-Stiftung (FES) 2008 FES in Nepal. FES http://www.fesnepal.org/about/fes_in_nepal.htm (accessed February 15, 2009)"
  },
  "index": 19,
  "key": "CIT0020",
  "ref_source": "crossref",
  "release_year": 2011,
  "release_ident": "xqgaanhpf5gotdxxxytgyxw2ty",
  "work_ident": "aaaq35j3angwzdpzcvzdil3v4y"
}
```

Also, these may say: "accessed at ..."

# URL

* url cleanup in place

# Partial Data Mapping

* how to map partial docs onto a key

# OL beyond ISBN

Example:

```
{
  "biblio": {
    "container_name": "The Debt: What America Owes to Blacks",
    "contrib_raw_names": [
      "R Robinson"
    ],
    "unstructured": "Randall Robinson, The Debt: What America Owes to Blacks. New York: Dutton Books, 2000, pp. 219–220.",
    "year": 2000
  },
  "index": 22,
  "key": "8_CR23",
  "ref_source": "crossref",
  "release_year": 2009,
  "release_ident": "2igycuiobvhxrcmmrzz6anufuq",
  "work_ident": "aaacj23jqbdxvajwj5kc6jpejq"
}
```

* https://openlibrary.org/works/OL488811W/The_debt?edition=debtwhatamerica000robi

However, there is no explicit "subtitle" fields, and in this case, the subtitle is buried in "text":

```
{
  "key": "/works/OL488811W",
  "text": [
    "/works/OL488811W",
    "The debt",
    "The Debt",
    "The Debt ",
    "what America owes to Blacks",
    "What America Owes to Blacks",
    "OL46591M",
    "OL7771042M",
    "OL7590904M",
    "OL3382710M",
    "Randall Robinson.",
    "2004556979",
    "99045728",
    "0452282101",
    "0525945245",
```

Subtitle in editions.

```
{
  "biblio": {
    "container_name": "BLACK AFRICA: The Economic and Cultural Basis for a Federated State",
    "unstructured": "For details on African Renaissance see Cheikh Anta Diop, BLACK AFRICA: The Economic and Cultural Basis for a Federated State, New Expanded Edition. Trenton, NJ: Africa World Press, 1987.",
    "year": 1987
  },
  "index": 28,
  "key": "8_CR29",
  "ref_source": "crossref",
  "release_year": 2009,
  "release_ident": "2igycuiobvhxrcmmrzz6anufuq",
  "work_ident": "aaacj23jqbdxvajwj5kc6jpejq"
}
```

## OL Loop

Some do not have an explicit "works" key, but still link to an edition.

* https://openlibrary.org/books/OL10000230M/Parliamentary_Debates_House_Of_Lords_2003-2004?edition=

> An edition of Parliamentary Debates, House Of Lords 2003-2004

Example edition:

```
{
  "publishers": [
    "Du Temps"
  ],
  "languages": [
    {
      "key": "/languages/fre"
    }
  ],
  "last_modified": {
    "type": "/type/datetime",
    "value": "2010-04-24T18:46:01.556464"
  },
  "weight": "5 ounces",
  "title": "Les Fleurs bleues de Raymond Queneau",
  "identifiers": {
    "goodreads": [
      "487215"
    ]
  },
  "isbn_13": [
    "9782842741013"
  ],
  "covers": [
    3140044
  ],
  "physical_format": "Paperback",
  "isbn_10": [
    "2842741013"
  ],
  "publish_date": "January 1, 2000",
  "key": "/books/OL12622734M",
  "authors": [
    {
      "key": "/authors/OL3964945A"
    }
  ],
  "latest_revision": 5,
  "works": [
    {
      "key": "/works/OL10000008W"
    }
  ],
  "type": {
    "key": "/type/edition"
  },
  "physical_dimensions": "8.4 x 5.7 x 0.3 inches",
  "revision": 5
}
```

Example Work:

```
{
  "title": "Les Fleurs bleues de Raymond Queneau",
  "created": {
    "type": "/type/datetime",
    "value": "2009-12-11T01:57:19.964652"
  },
  "covers": [
    3140044
  ],
  "last_modified": {
    "type": "/type/datetime",
    "value": "2010-04-28T06:54:19.472104"
  },
  "latest_revision": 3,
  "key": "/works/OL10000008W",
  "authors": [
    {
      "type": "/type/author_role",
      "author": {
        "key": "/authors/OL3964945A"
      }
    }
  ],
  "type": {
    "key": "/type/work"
  },
  "revision": 3
}
```

----

## Unmatched

If we exclude any id and title, we'll roughly have the following fields:

```
container_name|contrib_raw_names|year                                        64064559
unstructured                                                                 61711602
container_name|contrib_raw_names|volume|year                                 49701699
container_name|contrib_raw_names|unstructured|volume|year                    36401044
container_name|contrib_raw_names|unstructured|year                           26663422
contrib_raw_names|unstructured                                               16731608
container_name|contrib_raw_names|doi|unstructured|year                       14207167
container_name|contrib_raw_names|doi|year                                    13159340
```

Some examples:

```
{
  "biblio": {
    "container_name": "Intern. J. Comput. Math.",
    "contrib_raw_names": [
      "D. Levin"
    ],
    "volume": "B3",
    "year": 1973
  },
  "index": 19,
  "key": "PhysRevB.48.6913Cc15R1",
  "ref_source": "crossref",
  "release_year": 1993,
  "release_ident": "i6s6e64n55hh5oned32mdwrs2i",
  "work_ident": "aaaeuvgitzfafczctw3bseauri"
}
```

This refers to:

* https://www.tandfonline.com/doi/abs/10.1080/00207167308803075
* 1972, and not 1973, 1993
* https://fatcat.wiki/release/3cstmufhszalvpnppwxjohnnsa

It would help to go from "container name" to "issn", e.g. here: 0020-7160

* https://fatcat.wiki/release/search?q=levin+container_id%3A%22y4k3i2fvabgarkvywismzvy23a%22+year%3A1972

```
$ grep -i "Intern.*J.*Comput.*Math.*" jabbrev.json
{"name": "COMPEL-THE INTERNATIONAL JOURNAL FOR COMPUTATION AND MATHEMATICS IN ELECTRICAL AND ELECTRONIC ENGINEERING", "abbrev": "COMPEL"}
{"name": "INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE", "abbrev": "INT J APPL MATH COMP"}
{"name": "INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS", "abbrev": "INT J COMPUT MATH"}
```

Lookup name in issn:


```
$ zstdcat tmp/data.ndj.zst | grep -i "INTERNATIONAL JOURNAL OF COMPUTER MATHEMATICS" | jq .

  "@graph": [
    {
      "@id": "http://id.loc.gov/vocabulary/countries/enk",
      "label": "England"
    },
    {
      "@id": "organization/ISSNCenter#_1",
      "@type": "http://schema.org/Organization"
    },
    {
      "@id": "resource/ISSN-L/0020-7160",
      "identifiedBy": "resource/ISSN/0020-7160#ISSN-L"
    },
    {
      "@id": "resource/ISSN/0020-7160",
      "@type": [
        "http://id.loc.gov/ontologies/bibframe/Instance",
        "http://id.loc.gov/ontologies/bibframe/Work",
        "http://schema.org/Periodical"
      ],
      "format": "vocabularies/medium#Print",
      "http://purl.org/ontology/bibo/issn": "0020-7160",
      "identifiedBy": [
        "resource/ISSN/0020-7160#ISSN-L",
        "resource/ISSN/0020-7160#ISSN",
        "resource/ISSN/0020-7160#KeyTitle"
      ],
```

We would need:

* rough abbrev name -> full name (jabbrev) -> issn (issnlister) -> container id (fatcat)

Example, title match with OL:

```
{
  "biblio": {
    "container_name": "Private schooling in less economically developed countries",
    "contrib_raw_names": [
      "Caddell M."
    ],
    "year": 2008
  },
  "index": 7,
  "key": "CIT0008",
  "ref_source": "crossref",
  "release_year": 2011,
  "release_ident": "xqgaanhpf5gotdxxxytgyxw2ty",
  "work_ident": "aaaq35j3angwzdpzcvzdil3v4y"
}
```

A matching OL edition record:

```
{
  "publishers": [
    "Symposium Books"
  ],
  "languages": [
    {
      "key": "/languages/eng"
    }
  ],
  "number_of_pages": 214,
  "subtitle": "Asian and African Perspectives (Oxford Studies in Comparative Education)",
  "weight": "12.6 ounces",
  "title": "Private Schooling in Less Economically Developed Countries",
  "isbn_10": [
    "1873927851"
  ],
  "type": {
    "key": "/type/edition"
  },
  "identifiers": {
    "goodreads": [
      "1078335"
    ]
  },
  "isbn_13": [
    "9781873927854"
  ],
  "covers": [
    3020365
  ],
  "physical_format": "Paperback",
  "key": "/books/OL12102259M",
  "publish_date": "April 1, 2007",
  "contributions": [
    "Prachi Srivastava (Editor)",
    "Geoffrey Walford (Editor)"
  ],
  "subjects": [
    "Organization & management of education",
    "ASIA",
    "Africa",
    "Reference / General"
  ],
  "physical_dimensions": "9.1 x 6.1 x 0.7 inches",
  "works": [
    {
      "key": "/works/OL24081822W"
    }
  ],
  "lccn": [
    "2007408632"
  ],
  "lc_classifications": [
    "LC57.5 .P75 2007"
  ],
```


----

# Completeness

```
{
  "biblio": {
    "container_name": "La vida y época de Prebisch",
    "year": 2010
  },
  "index": 5,
  "key": "key20191115064515_B6",
  "ref_source": "crossref",
  "release_year": 2019,
  "release_ident": "oc6nhkoah5gcnjfsjpct4ij3ea",
  "work_ident": "aaachbf2kbdnxekwdujbmnlw4a"
}
```

* https://fatcat.wiki/release/oc6nhkoah5gcnjfsjpct4ij3ea/references
* https://www.iberoamericana.se/articles/10.16993/iberoamericana.467/galley/445/download/

In the PDF, we find a DOI as well, but it seems to be not extracted. In fact;
the ref data comes from crossref.

Grobid gets the DOI:

```
<biblStruct xml:id="b5">
        <monogr>
                <title level="m" type="main">La vida y época de Prebisch. 1901-1986. Madrid: Marcial Pons</title>
                <author>
                        <persName xmlns="http://www.tei-c.org/ns/1.0"><forename type="first">E</forename><forename type="middle">J</forename><surname>Dosman</surname></persName>
                </author>
                <idno type="DOI">10.18356/40a5d411-es</idno>
                <ptr target="https://doi.org/10.18356/40a5d411-es" />
                <imprint>
                        <date type="published" when="2010" />
                </imprint>
        </monogr>
</biblStruct>
```

Other issues:

* year vs release_year

```
{
  "biblio": {
    "container_name": "The Methodology of Scientific Research Programmes",
    "year": 1980
  },
  "index": 13,
  "key": "key20191115064515_B14",
  "ref_source": "crossref",
  "release_year": 2019,
  "release_ident": "oc6nhkoah5gcnjfsjpct4ij3ea",
  "work_ident": "aaachbf2kbdnxekwdujbmnlw4a"
}
```

## Conservative Verification

* closeby, but different years, although it seems it would actually be a match

```
different       year    /works/OL13199655W      lvtfhk63kjbthacu2aam3jgudu      1000000delinquents      1000000 delinquents     1,000,000 Delinquents
different       year    /works/OL13199655W      wpp46slm6nca7b3nwdwtjbegla      1000000delinquents      1000000 delinquents     1,000,000 Delinquents
different       year    /works/OL13199655W      gvzsp7pz75d6roxrrarduprmie      1000000delinquents      1000000 delinquents     1,000,000 Delinquents
different       year    /works/OL13199655W      ilga2kj4nnaqdh4rmogbsgbgbe      1000000delinquents      1000000 delinquents     1,000,000 Delinquents
different       year    /works/OL13199655W      5ujpef3vjzhkvmse6ovey2q2zi      1000000delinquents      1000000 delinquents     1,000,000 Delinquents
```

## Journal name augmentation

In ~160M unmatched refs (release format) we could resolve 14M container names, via `skate-resolve-journal-name`.

```
$ zstdcat date-2021-05-06.tsv.zst | skate-resolve-journal-name -B -A /magna/data/jabbrev.json | cut -f 2 | pv -l | LC_ALL=C grep -cF resolved_container_name
2021/06/01 13:02:20 found 27178 abbreviation mappings
 160M 0:14:49 [ 180k/s] [                                          <=>                                                                                                                                                                        ]
14090677
```

## Discrepancy

* https://fatcat.wiki/release/cgmnjwrhlvccxnxyewd4buuhzm/references

UnmatchedRefs contains entry 11:

```
{
  "biblio": {
    "container_name": "Med J Aust",
    "contrib_raw_names": [
      "Dracup K"
    ],
    "volume": "166",
    "year": 1997
  },
  "index": 9,
  "key": "bibr11-010740830802800102",
  "ref_source": "crossref",
  "release_year": 2008,
  "release_ident": "cgmnjwrhlvccxnxyewd4buuhzm",
  "work_ident": "aabzzlohgza2pfaol7cgqlvpke"
}
```

In frontend, we only have a DOI; https://fatcat.wiki/release/lookup?doi=10.1016/s0147-9563(97)90082-0

## More OL matching

> ran open library and fatcat fuzzy matching (via container name) on all docs,
> that did not have an id-based match; found 139M link candidates, of which
> verification 11M strong or exact matches, of which around 3M had some IA
> identifier (about 200K unique; but looking at a few of them, it seems these were somewhat
> restricted items, e.g. "print-disabled"

Most referenced items were:

```
  13010 ia:discoverygrounde00glas
   9351 ia:selfefficacyexer0000band
   8341 ia:basicsofqualitat0000stra
   7562 ia:researchdesignqu00cres
   7027 ia:basicsqualitativ00stra
   6397 ia:qualitativedataa00mile
   5958 ia:briefhistoryneol00harv
   5779 ia:constructinggrou00char
   5291 ia:reassemblingsoci00lato
   4908 ia:econometricanaly0000gree_f5q0
   4762 ia:powerknowledges00fouc
   4733 ia:stressappraisalc00rich
   4673 ia:locationculture00bhab_220
   4405 ia:threeworldswelfa00espi
   4330 ia:bodiesthatmatter00butl_662
   4306 ia:gendertroublefem0000butl_d7d5
   4249 ia:contentanalysisi00krip
   4059 ia:practiceofeveryd01cert
   4013 ia:intermolecularsu00isra
   3925 ia:culturesorganiza00hofs
   3621 ia:fractalgeometryo00beno
   3564 ia:modernityatlarge00appa
   3483 ia:modelselectionmu00burn_141
   3454 ia:economicinstitut00will
   3300 ia:economictheoryof00down_0
   3289 ia:seeinglikestateh00scot_250
   3204 ia:densityfunctiona00parr
   3158 ia:experienceeducat00dewe_0
   3127 ia:infraredramanspe00naka
   3114 ia:qualitativeinqui00cres_711
   3088 ia:sciencehumanbeha00bfsk
   3050 ia:structuralequati0000byrn_g1v4
   3050 ia:numericalrecipes0000unse_j9c5
   3003 ia:conductionheatso00cars
   2955 ia:fifthdisciplineasen00seng
   2793 ia:homosacersoverei00agam_937
   2746 ia:naturalisticinqu00linc
   2737 ia:postmoderncondit00lyot_037
   2673 ia:mathematicsdiffu00cran
   2576 ia:principlespracti0000klin
   2530 ia:hydrodynamichydr00chan
   2499 ia:strategicmanagem00free
   2493 ia:viscoelasticprop00ferr
   2454 ia:wehaveneverbeenm00lato_404
   2428 ia:foucaultreader00fouc
   2386 ia:languagesymbolic0000bour_1991
   2381 ia:weaponsofweakeve0000scot
   2380 ia:greattransforma000pola
   2370 ia:crossingqualityc00amer_984
```

## Glitch in GS?

The fractal geometry of nature is "cited by 47428"
(https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22fractal+geometry+of+nature%22&btnG=);
on page one of references, there is "Gaussian processes in machine learning"
(2003), via:
https://www.researchgate.net/profile/Olivier_Bousquet/publication/238718428_Advanced_Lectures_on_Machine_Learning_ML_Summer_Schools_2003_Canberra_Australia_February_2-14_2003_Tubingen_Germany_August_4-16_2003_Revised_Lectures/links/02e7e52c5870850311000000/Advanced-Lectures-on-Machine-Learning-ML-Summer-Schools-2003-Canberra-Australia-February-2-14-2003-Tuebingen-Germany-August-4-16-2003-Revised-Lectures.pdf#page=70
- the paper itself does not contain a reference -- in the whole document.


## OL fuzzy different

Reasons, why pairs were marked as *different*:

```
$ zstdcat -T UnmatchedOpenLibraryMatchTable/date-2021-05-06.tsv.zst  | grep ^different | cut -f2 | LC_ALL=C sort -S50% | uniq -c | sort -nr
47324670 year
46016349 contribintersectionempty
 582618 pagecount
    460 titlefilename
     25 numdiff
```

The `year` may refer to different editions:

* https://fatcat.wiki/release/kngofkvoo5cinj4wqerrey4tpi/references
* https://openlibrary.org/works/OL16286792W/One_hundred_and_seventeen_days?edition=onehundredsevent00firs

> 117 Days: An Account of Confinement and Interrogation under the South African
> 90-Day Detention Law.2006 | vs This edition was published in 1965 by Penguin
> Books

## Data mismatch

* FE: https://fatcat.wiki/release/niivpohpabhajdsf35x7hr4efm/references, [8]: 2011

refs (2017 only)

```
{
  "container_name": "19 & 20: Notes for a New Social Protagonism",
  "container": {
    "container_type": "",
    "ident": "",
    "issnl": "",
    "name": "",
    "publisher": "",
    "revision": "",
    "state": "",
    "wikidata_qid": ""
  },
  "contribs": [
    {
      "raw_name": "Colective Situaciones"
    }
  ],
  "ext_ids": {},
  "ident": "niivpohpabhajdsf35x7hr4efm",
  "release_year": "2017",
  "work_id": "7eghl5lcivfmha6d4uavrrkpce",
  "extra": {
    "crossref": {},
    "datacite": {},
    "skate": {
      "status": "ref",
      "ref": {
        "index": 7,
        "key": "\nkey\n\t\t\t\t20171225032503_CIT0007"
      },
      "rg": {},
      "resolved_container_name": ""
    },
    "ol": {}
  }
}
{
  "container_name": "A Dictionary of Marxist Thought (2nd ed.)",
  "container": {
    "container_type": "",
    "ident": "",
    "issnl": "",
    "name": "",
    "publisher": "",
    "revision": "",
    "state": "",
    "wikidata_qid": ""
  },
  "ext_ids": {},
  "ident": "niivpohpabhajdsf35x7hr4efm",
  "release_year": "2017",
  "title": "Price of production and the transformation problem",
  "work_id": "7eghl5lcivfmha6d4uavrrkpce",
  "extra": {
    "crossref": {},
    "datacite": {},
    "skate": {
      "status": "ref",
      "ref": {
        "index": 12,
        "key": "\nkey\n\t\t\t\t20171225032503_CIT0012"
      },
      "rg": {},
      "resolved_container_name": ""
    },
    "ol": {}
  }
}
{
  "container_name": "A Grammar of the Multitude: For an Analysis of Contemporary Forms of Life",
  "container": {
    "container_type": "",
    "ident": "",
    "issnl": "",
    "name": "",
    "publisher": "",
    "revision": "",
    "state": "",
    "wikidata_qid": ""
  },
  "ext_ids": {},
  "ident": "niivpohpabhajdsf35x7hr4efm",
  "release_year": "2017",
  "work_id": "7eghl5lcivfmha6d4uavrrkpce",
  "extra": {
    "crossref": {},
    "datacite": {},
    "skate": {
      "status": "ref",
      "ref": {
        "index": 45,
        "key": "\nkey\n\t\t\t\t20171225032503_CIT0044"
      },
      "rg": {},
      "resolved_container_name": ""
    },
    "ol": {}
  }
}
{
  "container_name": "An Introduction to the Three Volumes of Karl Marx's Capital",
  "container": {
    "container_type": "",
    "ident": "",
    "issnl": "",
    "name": "",
    "publisher": "",
    "revision": "",
    "state": "",
    "wikidata_qid": ""
  },
  "ext_ids": {},
  "ident": "niivpohpabhajdsf35x7hr4efm",
  "release_year": "2017",
  "work_id": "7eghl5lcivfmha6d4uavrrkpce",
  "extra": {
    "crossref": {},
    "datacite": {},
    "skate": {
      "status": "ref",
      "ref": {
        "index": 21,
        "key": "\nkey\n\t\t\t\t20171225032503_CIT0020"
      },
      "rg": {},
      "resolved_container_name": ""
    },
    "ol": {}
  }
}
```