# Version 2 (2021-02-18) As target document we want, as per `proposals/2021-01-29_citation_api.md` the following: ``` BiblioRef ("bibliographic reference") _key: Optional[str] elasticsearch doc key ("release", source_release_ident, ref_index) ("wikipedia", source_wikipedia_article, ref_index) update_ts: Optional[datetime] elasticsearch doc timestamp # metadata about source of reference source_release_ident: Optional[str] source_work_ident: Optional[str] source_wikipedia_article: Optional[str] with lang prefix like "en:Superglue" # skipped: source_openlibrary_work # skipped: source_url_surt source_release_stage: Optional[str] source_year: Optional[int] # context of the reference itself ref_index: int 1-indexed, not 0-indexed ref_key: Optional[str] eg, "Lee86", "BIB23" ref_locator: Optional[str] eg, page number # target of reference (identifiers) target_release_ident: Optional[str] target_work_ident: Optional[str] target_openlibrary_work: Optional[str] target_url_surt: Optional[str] target_url: Optional[str] would not be stored in elasticsearch, but would be auto-generated by all "get" methods from the SURT, so calling code does not need to do SURT transform # skipped: target_wikipedia_article match_provenance: str crossref, pubmed, grobid, etc match_status: Optional[str] strong, weak, etc TODO: "match_strength"? match_reason: Optional[str] "doi", "isbn", "fuzzy title, author", etc maybe "fuzzy-title-author"? target_unstructured: string (only if no release_ident link/match) target_csl: free-form JSON (only if no release_ident link/match) CSL-JSON schema (similar to ReleaseEntity schema, but not exactly) generated from unstructured by a GROBID parse, if needed ``` This resulting docs/index will be generated from various pipelines: * various identifier joins (doi, pmid, pmcid, arxiv, ...) * a fuzzy matching pipeline * a wikipedia "scan" over publications, by DOI, title, direct link * an open library "scan", matching possibly ISBN or book titles against the catalog * relating a source document to all its referenced web pages (as `target_url`) The raw inputs: * release export (expanded or minimized) * an aggregated list of references * wikipedia dumps, e.g. en, de, fr, es, ... * an openlibrary dump * auxiliary data structures, e.g. journal name lookup database (abbreviations), etc. * MAG, base, aminer, and other datasets to run comparisons against # Setup and deployment * [-] clone this repo * [x] copy "zipapp" * [x] setup raw inputs in settings.ini * [x] run task Using shiv for creating single-file deployment. Single config file. Handle to list and inspect files. Keep it minimal. External tools in skate. ---- # Match with more complete data * [x] more sensible changing between inputs (e.g. sample, full, etc.) For joins. * [x] reduce release entities to minimum (ReleaseEntityReduced) Reduced 120G to 48G, big win (stipping files, refs, and container extra); 154203375 docs (12min to count) * [ ] extract not to (ident, value), but (ident, value, doc) or the like * [ ] the joined row should contain both md blobs to generate fuller schema Zipped Merge We need: * refs to releases, derive key, sort * reduced releases, derive key, sort * [ ] sort fatcat and refs by key * [ ] zipped iteration over both docs (and run verify) ---- # Other datasets * [ ] https://archive.org/details/enwiki-20210120, example: https://archive.org/download/enwiki-20210120/enwiki-20210120-pages-articles-multistream11.xml-p6899367p7054859.bz2 ---- ## Zipped Verification * beside a one blob per line model, we can run a "comm" like procedure to verify group (or run any other routine on groups) Advantages of zip mode: * only need to generate any sorted dataset; we can save the "group by" transform * easier to carry the whole doc around, which is what we want, to generate a more complete result document ``` $ skate-verify -m zip -R <(zstdcat -T0 /bigger/.cache/refcat/FatcatSortedKeys/dataset-full-date-2021-02-20.json.zst) \ -F <(zstdcat -T0 /bigger/.cache/refcat/RefsSortedKeys/dataset-full-date-2021-02-20.json.zst) ``` A basic framework in Go for doing zipped iteration. * we need the generic (id, key, doc) format, maybe just a jq tweak ---- Example size increase by carrying data to the key matching step; about 10x (3 to 30G compressed). ---- * Putting pieces together: * 620,626,126 DOI "join" * 23,280,469 fuzzy * 76,382,408 pmid * 49,479 pmcid * 3,011,747 arxiv COCI/crossref has currently: * 759,516,507 citation links. * we: ~723,350,228 ``` $ zstdcat -T0 /bigger/.cache/refcat/BiblioRefV1/dataset-full-date-2021-02-20.json.zst|LC_ALL=C wc 717435777 717462400 281422956549 ``` ---- Some notes on unparsed data: ``` "unstructured": "S. F. Fischer and A. Laubereau, Chem. Phys. Lett. 55, 189 (1978).CHPLBC0009-2614" $ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst| jq -rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == null and .biblio.unstructured != null) | .biblio.unstructured' | head -1000000 | grep -c -E ' [0-9]{1,3}-[0-9]{1,3}' ``` * 4400/100000; 5% of 500M would still be 25M? * pattern matching? ``` $ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst | jq -rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == null and .biblio.unstructured != null) | .biblio.unstructured' ``` Data lineage for "v2": ``` $ refcat.pyz deps BiblioRefV2 \_ BiblioRefV2(dataset=full, date=2021-02-20) \_ BiblioRefZippyPMID(dataset=full, date=2021-02-20) \_ FatcatPMID(dataset=full, date=2021-02-20) \_ ReleaseExportReduced(dataset=full, date=2021-02-20) \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) \_ RefsPMID(dataset=full, date=2021-02-20) \_ Refs(dataset=full, date=2021-02-20) \_ BiblioRefFromFuzzyClusters(dataset=full, date=2021-02-20) \_ RefsFatcatClusters(dataset=full, date=2021-02-20) \_ RefsFatcatSortedKeys(dataset=full, date=2021-02-20) \_ RefsReleasesMerged(dataset=full, date=2021-02-20) \_ ReleaseExportReduced(dataset=full, date=2021-02-20) \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) \_ RefsToRelease(dataset=full, date=2021-02-20) \_ Refs(dataset=full, date=2021-02-20) \_ BiblioRefZippyPMCID(dataset=full, date=2021-02-20) \_ RefsPMCID(dataset=full, date=2021-02-20) \_ Refs(dataset=full, date=2021-02-20) \_ FatcatPMCID(dataset=full, date=2021-02-20) \_ ReleaseExportReduced(dataset=full, date=2021-02-20) \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) \_ BiblioRefZippyDOI(dataset=full, date=2021-02-20) \_ FatcatDOI(dataset=full, date=2021-02-20) \_ ReleaseExportReduced(dataset=full, date=2021-02-20) \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) \_ RefsDOI(dataset=full, date=2021-02-20) \_ Refs(dataset=full, date=2021-02-20) \_ BiblioRefZippyArxiv(dataset=full, date=2021-02-20) \_ RefsArxiv(dataset=full, date=2021-02-20) \_ Refs(dataset=full, date=2021-02-20) \_ FatcatArxiv(dataset=full, date=2021-02-20) \_ ReleaseExportReduced(dataset=full, date=2021-02-20) \_ ReleaseExportExpanded(dataset=full, date=2021-02-20) ``` ---- * reran V2 derivation from scratch on aitio (w/ unstructured) * 785569011 docs; 103% OCI