diff options
Diffstat (limited to 'python/notes/version_2.md')
-rw-r--r-- | python/notes/version_2.md | 219 |
1 files changed, 219 insertions, 0 deletions
diff --git a/python/notes/version_2.md b/python/notes/version_2.md new file mode 100644 index 0000000..873b5bf --- /dev/null +++ b/python/notes/version_2.md @@ -0,0 +1,219 @@ +# Version 2 (2021-02-18) + +As target document we want, as per `proposals/2021-01-29_citation_api.md` the following: + +``` +BiblioRef ("bibliographic reference") + _key: Optional[str] elasticsearch doc key + ("release", source_release_ident, ref_index) + ("wikipedia", source_wikipedia_article, ref_index) + update_ts: Optional[datetime] elasticsearch doc timestamp + + # metadata about source of reference + source_release_ident: Optional[str] + source_work_ident: Optional[str] + source_wikipedia_article: Optional[str] + with lang prefix like "en:Superglue" + # skipped: source_openlibrary_work + # skipped: source_url_surt + source_release_stage: Optional[str] + source_year: Optional[int] + + # context of the reference itself + ref_index: int + 1-indexed, not 0-indexed + ref_key: Optional[str] + eg, "Lee86", "BIB23" + ref_locator: Optional[str] + eg, page number + + # target of reference (identifiers) + target_release_ident: Optional[str] + target_work_ident: Optional[str] + target_openlibrary_work: Optional[str] + target_url_surt: Optional[str] + target_url: Optional[str] + would not be stored in elasticsearch, but would be auto-generated + by all "get" methods from the SURT, so calling code does not need + to do SURT transform + # skipped: target_wikipedia_article + + match_provenance: str + crossref, pubmed, grobid, etc + match_status: Optional[str] + strong, weak, etc + TODO: "match_strength"? + match_reason: Optional[str] + "doi", "isbn", "fuzzy title, author", etc + maybe "fuzzy-title-author"? + + target_unstructured: string (only if no release_ident link/match) + target_csl: free-form JSON (only if no release_ident link/match) + CSL-JSON schema (similar to ReleaseEntity schema, but not exactly) + generated from unstructured by a GROBID parse, if needed +``` + +This resulting docs/index will be generated from various pipelines: + +* various identifier joins (doi, pmid, pmcid, arxiv, ...) +* a fuzzy matching pipeline +* a wikipedia "scan" over publications, by DOI, title, direct link +* an open library "scan", matching possibly ISBN or book titles against the catalog +* relating a source document to all its referenced web pages (as `target_url`) + +The raw inputs: + +* release export (expanded or minimized) +* an aggregated list of references +* wikipedia dumps, e.g. en, de, fr, es, ... +* an openlibrary dump +* auxiliary data structures, e.g. journal name lookup database (abbreviations), etc. +* MAG, base, aminer, and other datasets to run comparisons against + +# Setup and deployment + +* [-] clone this repo +* [x] copy "zipapp" +* [x] setup raw inputs in settings.ini +* [x] run task + +Using shiv for creating single-file deployment. Single config file. Handle to +list and inspect files. Keep it minimal. External tools in skate. + +---- + +# Match with more complete data + +* [x] more sensible changing between inputs (e.g. sample, full, etc.) + +For joins. + +* [x] reduce release entities to minimum (ReleaseEntityReduced) + +Reduced 120G to 48G, big win (stipping files, refs, and container extra); 154203375 docs (12min to count) + +* [ ] extract not to (ident, value), but (ident, value, doc) or the like +* [ ] the joined row should contain both md blobs to generate fuller schema + +Zipped Merge + +We need: + +* refs to releases, derive key, sort +* reduced releases, derive key, sort + +* [ ] sort fatcat and refs by key +* [ ] zipped iteration over both docs (and run verify) + +---- + +# Other datasets + +* [ ] https://archive.org/details/enwiki-20210120, example: https://archive.org/download/enwiki-20210120/enwiki-20210120-pages-articles-multistream11.xml-p6899367p7054859.bz2 + +---- + +## Zipped Verification + +* beside a one blob per line model, we can run a "comm" like procedure to verify group (or run any other routine on groups) + +Advantages of zip mode: + +* only need to generate any sorted dataset; we can save the "group by" transform +* easier to carry the whole doc around, which is what we want, to generate a + more complete result document + +``` +$ skate-verify -m zip -R <(zstdcat -T0 /bigger/.cache/refcat/FatcatSortedKeys/dataset-full-date-2021-02-20.json.zst) \ + -F <(zstdcat -T0 /bigger/.cache/refcat/RefsSortedKeys/dataset-full-date-2021-02-20.json.zst) +``` + +A basic framework in Go for doing zipped iteration. + +* we need the generic (id, key, doc) format, maybe just a jq tweak + +---- + +Example size increase by carrying data to the key matching step; about 10x (3 to 30G compressed). + +---- + +* Putting pieces together: + +* 620,626,126 DOI "join" +* 23,280,469 fuzzy +* 76,382,408 pmid +* 49,479 pmcid +* 3,011,747 arxiv + +COCI/crossref has currently: + +* 759,516,507 citation links. +* we: ~723,350,228 + +``` +$ zstdcat -T0 /bigger/.cache/refcat/BiblioRefV1/dataset-full-date-2021-02-20.json.zst|LC_ALL=C wc +717435777 717462400 281422956549 +``` + +---- + +Some notes on unparsed data: + +``` + "unstructured": "S. F. Fischer and A. Laubereau, Chem. Phys. Lett. 55, 189 (1978).CHPLBC0009-2614" + +$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst| jq +-rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == +null and .biblio.unstructured != null) | .biblio.unstructured' | head -1000000 +| grep -c -E ' [0-9]{1,3}-[0-9]{1,3}' +``` + +* 4400/100000; 5% of 500M would still be 25M? + + + +* pattern matching? + +``` +$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst | jq -rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == null and .biblio.unstructured != null) | .biblio.unstructured' +``` + +Data lineage for "v2": + +``` +$ refcat.pyz deps BiblioRefV2 + \_ BiblioRefV2(dataset=full, date=2021-02-20) + \_ BiblioRefZippyPMID(dataset=full, date=2021-02-20) + \_ FatcatPMID(dataset=full, date=2021-02-20) + \_ ReleaseExportReduced(dataset=full, date=2021-02-20) + \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) + \_ RefsPMID(dataset=full, date=2021-02-20) + \_ Refs(dataset=full, date=2021-02-20) + \_ BiblioRefFromFuzzyClusters(dataset=full, date=2021-02-20) + \_ RefsFatcatClusters(dataset=full, date=2021-02-20) + \_ RefsFatcatSortedKeys(dataset=full, date=2021-02-20) + \_ RefsReleasesMerged(dataset=full, date=2021-02-20) + \_ ReleaseExportReduced(dataset=full, date=2021-02-20) + \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) + \_ RefsToRelease(dataset=full, date=2021-02-20) + \_ Refs(dataset=full, date=2021-02-20) + \_ BiblioRefZippyPMCID(dataset=full, date=2021-02-20) + \_ RefsPMCID(dataset=full, date=2021-02-20) + \_ Refs(dataset=full, date=2021-02-20) + \_ FatcatPMCID(dataset=full, date=2021-02-20) + \_ ReleaseExportReduced(dataset=full, date=2021-02-20) + \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) + \_ BiblioRefZippyDOI(dataset=full, date=2021-02-20) + \_ FatcatDOI(dataset=full, date=2021-02-20) + \_ ReleaseExportReduced(dataset=full, date=2021-02-20) + \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) + \_ RefsDOI(dataset=full, date=2021-02-20) + \_ Refs(dataset=full, date=2021-02-20) + \_ BiblioRefZippyArxiv(dataset=full, date=2021-02-20) + \_ RefsArxiv(dataset=full, date=2021-02-20) + \_ Refs(dataset=full, date=2021-02-20) + \_ FatcatArxiv(dataset=full, date=2021-02-20) + \_ ReleaseExportReduced(dataset=full, date=2021-02-20) + \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) +``` |