Version 2 (2021-02-18)
As target document we want, as per proposals/2021-01-29_citation_api.md
the following:
BiblioRef ("bibliographic reference")
_key: Optional[str] elasticsearch doc key
("release", source_release_ident, ref_index)
("wikipedia", source_wikipedia_article, ref_index)
update_ts: Optional[datetime] elasticsearch doc timestamp
# metadata about source of reference
source_release_ident: Optional[str]
source_work_ident: Optional[str]
source_wikipedia_article: Optional[str]
with lang prefix like "en:Superglue"
# skipped: source_openlibrary_work
# skipped: source_url_surt
source_release_stage: Optional[str]
source_year: Optional[int]
# context of the reference itself
ref_index: int
1-indexed, not 0-indexed
ref_key: Optional[str]
eg, "Lee86", "BIB23"
ref_locator: Optional[str]
eg, page number
# target of reference (identifiers)
target_release_ident: Optional[str]
target_work_ident: Optional[str]
target_openlibrary_work: Optional[str]
target_url_surt: Optional[str]
target_url: Optional[str]
would not be stored in elasticsearch, but would be auto-generated
by all "get" methods from the SURT, so calling code does not need
to do SURT transform
# skipped: target_wikipedia_article
match_provenance: str
crossref, pubmed, grobid, etc
match_status: Optional[str]
strong, weak, etc
TODO: "match_strength"?
match_reason: Optional[str]
"doi", "isbn", "fuzzy title, author", etc
maybe "fuzzy-title-author"?
target_unstructured: string (only if no release_ident link/match)
target_csl: free-form JSON (only if no release_ident link/match)
CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
generated from unstructured by a GROBID parse, if needed
This resulting docs/index will be generated from various pipelines:
- various identifier joins (doi, pmid, pmcid, arxiv, ...)
- a fuzzy matching pipeline
- a wikipedia "scan" over publications, by DOI, title, direct link
- an open library "scan", matching possibly ISBN or book titles against the catalog
- relating a source document to all its referenced web pages (as
target_url
)
The raw inputs:
- release export (expanded or minimized)
- an aggregated list of references
- wikipedia dumps, e.g. en, de, fr, es, ...
- an openlibrary dump
- auxiliary data structures, e.g. journal name lookup database (abbreviations), etc.
- MAG, base, aminer, and other datasets to run comparisons against
Setup and deployment
- [-] clone this repo
- [x] copy "zipapp"
- [x] setup raw inputs in settings.ini
- [x] run task
Using shiv for creating single-file deployment. Single config file. Handle to list and inspect files. Keep it minimal. External tools in skate.
Match with more complete data
- [x] more sensible changing between inputs (e.g. sample, full, etc.)
For joins.
- [x] reduce release entities to minimum (ReleaseEntityReduced)
Reduced 120G to 48G, big win (stipping files, refs, and container extra); 154203375 docs (12min to count)
- [ ] extract not to (ident, value), but (ident, value, doc) or the like
- [ ] the joined row should contain both md blobs to generate fuller schema
Zipped Merge
We need:
- refs to releases, derive key, sort
-
reduced releases, derive key, sort
-
[ ] sort fatcat and refs by key
- [ ] zipped iteration over both docs (and run verify)
Other datasets
- [ ] https://archive.org/details/enwiki-20210120, example: https://archive.org/download/enwiki-20210120/enwiki-20210120-pages-articles-multistream11.xml-p6899367p7054859.bz2
Zipped Verification
- beside a one blob per line model, we can run a "comm" like procedure to verify group (or run any other routine on groups)
Advantages of zip mode:
- only need to generate any sorted dataset; we can save the "group by" transform
- easier to carry the whole doc around, which is what we want, to generate a more complete result document
$ skate-verify -m zip -R <(zstdcat -T0 /bigger/.cache/refcat/FatcatSortedKeys/dataset-full-date-2021-02-20.json.zst) \
-F <(zstdcat -T0 /bigger/.cache/refcat/RefsSortedKeys/dataset-full-date-2021-02-20.json.zst)
A basic framework in Go for doing zipped iteration.
- we need the generic (id, key, doc) format, maybe just a jq tweak
Example size increase by carrying data to the key matching step; about 10x (3 to 30G compressed).
-
Putting pieces together:
-
620,626,126 DOI "join"
- 23,280,469 fuzzy
- 76,382,408 pmid
- 49,479 pmcid
- 3,011,747 arxiv
COCI/crossref has currently:
- 759,516,507 citation links.
- we: ~723,350,228
$ zstdcat -T0 /bigger/.cache/refcat/BiblioRefV1/dataset-full-date-2021-02-20.json.zst|LC_ALL=C wc
717435777 717462400 281422956549
Some notes on unparsed data:
"unstructured": "S. F. Fischer and A. Laubereau, Chem. Phys. Lett. 55, 189 (1978).CHPLBC0009-2614"
$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst| jq
-rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid ==
null and .biblio.unstructured != null) | .biblio.unstructured' | head -1000000
| grep -c -E ' [0-9]{1,3}-[0-9]{1,3}'
-
4400/100000; 5% of 500M would still be 25M?
-
pattern matching?
$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst | jq -rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == null and .biblio.unstructured != null) | .biblio.unstructured'
Data lineage for "v2":
$ refcat.pyz deps BiblioRefV2
\_ BiblioRefV2(dataset=full, date=2021-02-20)
\_ BiblioRefZippyPMID(dataset=full, date=2021-02-20)
\_ FatcatPMID(dataset=full, date=2021-02-20)
\_ ReleaseExportReduced(dataset=full, date=2021-02-20)
\_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
\_ RefsPMID(dataset=full, date=2021-02-20)
\_ Refs(dataset=full, date=2021-02-20)
\_ BiblioRefFromFuzzyClusters(dataset=full, date=2021-02-20)
\_ RefsFatcatClusters(dataset=full, date=2021-02-20)
\_ RefsFatcatSortedKeys(dataset=full, date=2021-02-20)
\_ RefsReleasesMerged(dataset=full, date=2021-02-20)
\_ ReleaseExportReduced(dataset=full, date=2021-02-20)
\_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
\_ RefsToRelease(dataset=full, date=2021-02-20)
\_ Refs(dataset=full, date=2021-02-20)
\_ BiblioRefZippyPMCID(dataset=full, date=2021-02-20)
\_ RefsPMCID(dataset=full, date=2021-02-20)
\_ Refs(dataset=full, date=2021-02-20)
\_ FatcatPMCID(dataset=full, date=2021-02-20)
\_ ReleaseExportReduced(dataset=full, date=2021-02-20)
\_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
\_ BiblioRefZippyDOI(dataset=full, date=2021-02-20)
\_ FatcatDOI(dataset=full, date=2021-02-20)
\_ ReleaseExportReduced(dataset=full, date=2021-02-20)
\_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
\_ RefsDOI(dataset=full, date=2021-02-20)
\_ Refs(dataset=full, date=2021-02-20)
\_ BiblioRefZippyArxiv(dataset=full, date=2021-02-20)
\_ RefsArxiv(dataset=full, date=2021-02-20)
\_ Refs(dataset=full, date=2021-02-20)
\_ FatcatArxiv(dataset=full, date=2021-02-20)
\_ ReleaseExportReduced(dataset=full, date=2021-02-20)
\_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
- reran V2 derivation from scratch on aitio (w/ unstructured)
- 785569011 docs; 103% OCI