Version 2 (2021-02-18)

As target document we want, as per proposals/2021-01-29_citation_api.md the following:

BiblioRef ("bibliographic reference")
    _key: Optional[str] elasticsearch doc key
        ("release", source_release_ident, ref_index)
        ("wikipedia", source_wikipedia_article, ref_index)
    update_ts: Optional[datetime] elasticsearch doc timestamp

    # metadata about source of reference
    source_release_ident: Optional[str]
    source_work_ident: Optional[str]
    source_wikipedia_article: Optional[str]
        with lang prefix like "en:Superglue"
    # skipped: source_openlibrary_work
    # skipped: source_url_surt
    source_release_stage: Optional[str]
    source_year: Optional[int]

    # context of the reference itself
    ref_index: int
        1-indexed, not 0-indexed
    ref_key: Optional[str]
        eg, "Lee86", "BIB23"
    ref_locator: Optional[str]
        eg, page number

    # target of reference (identifiers)
    target_release_ident: Optional[str]
    target_work_ident: Optional[str]
    target_openlibrary_work: Optional[str]
    target_url_surt: Optional[str]
    target_url: Optional[str]
        would not be stored in elasticsearch, but would be auto-generated
        by all "get" methods from the SURT, so calling code does not need
        to do SURT transform
    # skipped: target_wikipedia_article

    match_provenance: str
        crossref, pubmed, grobid, etc
    match_status: Optional[str]
        strong, weak, etc
        TODO: "match_strength"?
    match_reason: Optional[str]
        "doi", "isbn", "fuzzy title, author", etc
        maybe "fuzzy-title-author"?

    target_unstructured: string (only if no release_ident link/match)
    target_csl: free-form JSON (only if no release_ident link/match)
        CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
        generated from unstructured by a GROBID parse, if needed

This resulting docs/index will be generated from various pipelines:

various identifier joins (doi, pmid, pmcid, arxiv, ...)
a fuzzy matching pipeline
a wikipedia "scan" over publications, by DOI, title, direct link
an open library "scan", matching possibly ISBN or book titles against the catalog
relating a source document to all its referenced web pages (as target_url)

The raw inputs:

release export (expanded or minimized)
an aggregated list of references
wikipedia dumps, e.g. en, de, fr, es, ...
an openlibrary dump
auxiliary data structures, e.g. journal name lookup database (abbreviations), etc.
MAG, base, aminer, and other datasets to run comparisons against

Setup and deployment

[-] clone this repo
[x] copy "zipapp"
[x] setup raw inputs in settings.ini
[x] run task

Using shiv for creating single-file deployment. Single config file. Handle to list and inspect files. Keep it minimal. External tools in skate.

Match with more complete data

[x] more sensible changing between inputs (e.g. sample, full, etc.)

For joins.

[x] reduce release entities to minimum (ReleaseEntityReduced)

Reduced 120G to 48G, big win (stipping files, refs, and container extra); 154203375 docs (12min to count)

[ ] extract not to (ident, value), but (ident, value, doc) or the like
[ ] the joined row should contain both md blobs to generate fuller schema

Zipped Merge

We need:

refs to releases, derive key, sort
reduced releases, derive key, sort
[ ] sort fatcat and refs by key
[ ] zipped iteration over both docs (and run verify)

Other datasets

[ ] https://archive.org/details/enwiki-20210120, example: https://archive.org/download/enwiki-20210120/enwiki-20210120-pages-articles-multistream11.xml-p6899367p7054859.bz2

Zipped Verification

beside a one blob per line model, we can run a "comm" like procedure to verify group (or run any other routine on groups)

Advantages of zip mode:

only need to generate any sorted dataset; we can save the "group by" transform
easier to carry the whole doc around, which is what we want, to generate a more complete result document

$ skate-verify -m zip -R <(zstdcat -T0 /bigger/.cache/refcat/FatcatSortedKeys/dataset-full-date-2021-02-20.json.zst) \
    -F <(zstdcat -T0 /bigger/.cache/refcat/RefsSortedKeys/dataset-full-date-2021-02-20.json.zst)

A basic framework in Go for doing zipped iteration.

we need the generic (id, key, doc) format, maybe just a jq tweak

Example size increase by carrying data to the key matching step; about 10x (3 to 30G compressed).

Putting pieces together:
620,626,126 DOI "join"
23,280,469 fuzzy
76,382,408 pmid
49,479 pmcid
3,011,747 arxiv

COCI/crossref has currently:

759,516,507 citation links.
we: ~723,350,228

$ zstdcat -T0 /bigger/.cache/refcat/BiblioRefV1/dataset-full-date-2021-02-20.json.zst|LC_ALL=C wc
717435777 717462400 281422956549

Some notes on unparsed data:

    "unstructured": "S. F. Fischer and A. Laubereau, Chem. Phys. Lett. 55, 189 (1978).CHPLBC0009-2614"

$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst| jq
-rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid ==
null and .biblio.unstructured != null) | .biblio.unstructured' | head -1000000
| grep -c -E ' [0-9]{1,3}-[0-9]{1,3}'

4400/100000; 5% of 500M would still be 25M?
pattern matching?

$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst | jq -rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == null and .biblio.unstructured != null) | .biblio.unstructured'

Data lineage for "v2":

$ refcat.pyz deps BiblioRefV2
 \_ BiblioRefV2(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyPMID(dataset=full, date=2021-02-20)
       \_ FatcatPMID(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
       \_ RefsPMID(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
    \_ BiblioRefFromFuzzyClusters(dataset=full, date=2021-02-20)
       \_ RefsFatcatClusters(dataset=full, date=2021-02-20)
          \_ RefsFatcatSortedKeys(dataset=full, date=2021-02-20)
             \_ RefsReleasesMerged(dataset=full, date=2021-02-20)
                \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
                   \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
                \_ RefsToRelease(dataset=full, date=2021-02-20)
                   \_ Refs(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyPMCID(dataset=full, date=2021-02-20)
       \_ RefsPMCID(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
       \_ FatcatPMCID(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyDOI(dataset=full, date=2021-02-20)
       \_ FatcatDOI(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
       \_ RefsDOI(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
    \_ BiblioRefZippyArxiv(dataset=full, date=2021-02-20)
       \_ RefsArxiv(dataset=full, date=2021-02-20)
          \_ Refs(dataset=full, date=2021-02-20)
       \_ FatcatArxiv(dataset=full, date=2021-02-20)
          \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
             \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)

reran V2 derivation from scratch on aitio (w/ unstructured)
785569011 docs; 103% OCI