aboutsummaryrefslogtreecommitdiffstats
path: root/python/notes/version_2.md
diff options
context:
space:
mode:
Diffstat (limited to 'python/notes/version_2.md')
-rw-r--r--python/notes/version_2.md219
1 files changed, 219 insertions, 0 deletions
diff --git a/python/notes/version_2.md b/python/notes/version_2.md
new file mode 100644
index 0000000..873b5bf
--- /dev/null
+++ b/python/notes/version_2.md
@@ -0,0 +1,219 @@
+# Version 2 (2021-02-18)
+
+As target document we want, as per `proposals/2021-01-29_citation_api.md` the following:
+
+```
+BiblioRef ("bibliographic reference")
+ _key: Optional[str] elasticsearch doc key
+ ("release", source_release_ident, ref_index)
+ ("wikipedia", source_wikipedia_article, ref_index)
+ update_ts: Optional[datetime] elasticsearch doc timestamp
+
+ # metadata about source of reference
+ source_release_ident: Optional[str]
+ source_work_ident: Optional[str]
+ source_wikipedia_article: Optional[str]
+ with lang prefix like "en:Superglue"
+ # skipped: source_openlibrary_work
+ # skipped: source_url_surt
+ source_release_stage: Optional[str]
+ source_year: Optional[int]
+
+ # context of the reference itself
+ ref_index: int
+ 1-indexed, not 0-indexed
+ ref_key: Optional[str]
+ eg, "Lee86", "BIB23"
+ ref_locator: Optional[str]
+ eg, page number
+
+ # target of reference (identifiers)
+ target_release_ident: Optional[str]
+ target_work_ident: Optional[str]
+ target_openlibrary_work: Optional[str]
+ target_url_surt: Optional[str]
+ target_url: Optional[str]
+ would not be stored in elasticsearch, but would be auto-generated
+ by all "get" methods from the SURT, so calling code does not need
+ to do SURT transform
+ # skipped: target_wikipedia_article
+
+ match_provenance: str
+ crossref, pubmed, grobid, etc
+ match_status: Optional[str]
+ strong, weak, etc
+ TODO: "match_strength"?
+ match_reason: Optional[str]
+ "doi", "isbn", "fuzzy title, author", etc
+ maybe "fuzzy-title-author"?
+
+ target_unstructured: string (only if no release_ident link/match)
+ target_csl: free-form JSON (only if no release_ident link/match)
+ CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
+ generated from unstructured by a GROBID parse, if needed
+```
+
+This resulting docs/index will be generated from various pipelines:
+
+* various identifier joins (doi, pmid, pmcid, arxiv, ...)
+* a fuzzy matching pipeline
+* a wikipedia "scan" over publications, by DOI, title, direct link
+* an open library "scan", matching possibly ISBN or book titles against the catalog
+* relating a source document to all its referenced web pages (as `target_url`)
+
+The raw inputs:
+
+* release export (expanded or minimized)
+* an aggregated list of references
+* wikipedia dumps, e.g. en, de, fr, es, ...
+* an openlibrary dump
+* auxiliary data structures, e.g. journal name lookup database (abbreviations), etc.
+* MAG, base, aminer, and other datasets to run comparisons against
+
+# Setup and deployment
+
+* [-] clone this repo
+* [x] copy "zipapp"
+* [x] setup raw inputs in settings.ini
+* [x] run task
+
+Using shiv for creating single-file deployment. Single config file. Handle to
+list and inspect files. Keep it minimal. External tools in skate.
+
+----
+
+# Match with more complete data
+
+* [x] more sensible changing between inputs (e.g. sample, full, etc.)
+
+For joins.
+
+* [x] reduce release entities to minimum (ReleaseEntityReduced)
+
+Reduced 120G to 48G, big win (stipping files, refs, and container extra); 154203375 docs (12min to count)
+
+* [ ] extract not to (ident, value), but (ident, value, doc) or the like
+* [ ] the joined row should contain both md blobs to generate fuller schema
+
+Zipped Merge
+
+We need:
+
+* refs to releases, derive key, sort
+* reduced releases, derive key, sort
+
+* [ ] sort fatcat and refs by key
+* [ ] zipped iteration over both docs (and run verify)
+
+----
+
+# Other datasets
+
+* [ ] https://archive.org/details/enwiki-20210120, example: https://archive.org/download/enwiki-20210120/enwiki-20210120-pages-articles-multistream11.xml-p6899367p7054859.bz2
+
+----
+
+## Zipped Verification
+
+* beside a one blob per line model, we can run a "comm" like procedure to verify group (or run any other routine on groups)
+
+Advantages of zip mode:
+
+* only need to generate any sorted dataset; we can save the "group by" transform
+* easier to carry the whole doc around, which is what we want, to generate a
+ more complete result document
+
+```
+$ skate-verify -m zip -R <(zstdcat -T0 /bigger/.cache/refcat/FatcatSortedKeys/dataset-full-date-2021-02-20.json.zst) \
+ -F <(zstdcat -T0 /bigger/.cache/refcat/RefsSortedKeys/dataset-full-date-2021-02-20.json.zst)
+```
+
+A basic framework in Go for doing zipped iteration.
+
+* we need the generic (id, key, doc) format, maybe just a jq tweak
+
+----
+
+Example size increase by carrying data to the key matching step; about 10x (3 to 30G compressed).
+
+----
+
+* Putting pieces together:
+
+* 620,626,126 DOI "join"
+* 23,280,469 fuzzy
+* 76,382,408 pmid
+* 49,479 pmcid
+* 3,011,747 arxiv
+
+COCI/crossref has currently:
+
+* 759,516,507 citation links.
+* we: ~723,350,228
+
+```
+$ zstdcat -T0 /bigger/.cache/refcat/BiblioRefV1/dataset-full-date-2021-02-20.json.zst|LC_ALL=C wc
+717435777 717462400 281422956549
+```
+
+----
+
+Some notes on unparsed data:
+
+```
+ "unstructured": "S. F. Fischer and A. Laubereau, Chem. Phys. Lett. 55, 189 (1978).CHPLBC0009-2614"
+
+$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst| jq
+-rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid ==
+null and .biblio.unstructured != null) | .biblio.unstructured' | head -1000000
+| grep -c -E ' [0-9]{1,3}-[0-9]{1,3}'
+```
+
+* 4400/100000; 5% of 500M would still be 25M?
+
+
+
+* pattern matching?
+
+```
+$ zstdcat -T0 /bigger/scholar/fatcat_scholar_work_fulltext.refs.json.zst | jq -rc 'select(.biblio.title == null and .biblio.doi == null and .biblio.pmid == null and .biblio.unstructured != null) | .biblio.unstructured'
+```
+
+Data lineage for "v2":
+
+```
+$ refcat.pyz deps BiblioRefV2
+ \_ BiblioRefV2(dataset=full, date=2021-02-20)
+ \_ BiblioRefZippyPMID(dataset=full, date=2021-02-20)
+ \_ FatcatPMID(dataset=full, date=2021-02-20)
+ \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
+ \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
+ \_ RefsPMID(dataset=full, date=2021-02-20)
+ \_ Refs(dataset=full, date=2021-02-20)
+ \_ BiblioRefFromFuzzyClusters(dataset=full, date=2021-02-20)
+ \_ RefsFatcatClusters(dataset=full, date=2021-02-20)
+ \_ RefsFatcatSortedKeys(dataset=full, date=2021-02-20)
+ \_ RefsReleasesMerged(dataset=full, date=2021-02-20)
+ \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
+ \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
+ \_ RefsToRelease(dataset=full, date=2021-02-20)
+ \_ Refs(dataset=full, date=2021-02-20)
+ \_ BiblioRefZippyPMCID(dataset=full, date=2021-02-20)
+ \_ RefsPMCID(dataset=full, date=2021-02-20)
+ \_ Refs(dataset=full, date=2021-02-20)
+ \_ FatcatPMCID(dataset=full, date=2021-02-20)
+ \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
+ \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
+ \_ BiblioRefZippyDOI(dataset=full, date=2021-02-20)
+ \_ FatcatDOI(dataset=full, date=2021-02-20)
+ \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
+ \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
+ \_ RefsDOI(dataset=full, date=2021-02-20)
+ \_ Refs(dataset=full, date=2021-02-20)
+ \_ BiblioRefZippyArxiv(dataset=full, date=2021-02-20)
+ \_ RefsArxiv(dataset=full, date=2021-02-20)
+ \_ Refs(dataset=full, date=2021-02-20)
+ \_ FatcatArxiv(dataset=full, date=2021-02-20)
+ \_ ReleaseExportReduced(dataset=full, date=2021-02-20)
+ \_ ReleaseExportExpanded(dataset=full, date=2021-02-20)
+```