add ol and wikipedia notes

author: Martin Czygan <martin.czygan@gmail.com> 2021-03-21 01:39:13 +0100
committer: Martin Czygan <martin.czygan@gmail.com> 2021-03-21 01:39:13 +0100
commit: 1eae78c37dcb605c369d977f4ad764694603641b (patch)
tree: 45c6f500c8d004f341939b2c027fba30147ed7cd /python/notes
parent: 6af4de12553fe1fdbb1e08342df0a84052e985cb (diff)
download: refcat-1eae78c37dcb605c369d977f4ad764694603641b.tar.gz
refcat-1eae78c37dcb605c369d977f4ad764694603641b.zip
2 files changed, 112 insertions, 0 deletions
diff --git a/python/notes/openlibrary_works.md b/python/notes/openlibrary_works.md
new file mode 100644
index 0000000..25df527
--- /dev/null
+++ b/python/notes/openlibrary_works.md
@@ -0,0 +1,27 @@
+
+## Upstream Dumps
+
+Open Library does monthly bulk dumps: <https://archive.org/details/ol_exports?sort=-publicdate>
+
+Latest work dump: <https://openlibrary.org/data/ol_dump_works_latest.txt.gz>
+
+TSV columns:
+
+    type - type of record (/type/edition, /type/work etc.)
+    key - unique key of the record. (/books/OL1M etc.)
+    revision - revision number of the record
+    last_modified - last modified timestamp
+    JSON - the complete record in JSON format
+
+    zcat ol_dump_works_latest.txt.gz | cut -f5 | head | jq .
+
+We are going to want:
+
+- title (with "prefix"?)
+- authors
+- subtitle
+- year
+- identifier (work? edition?)
+- isbn-13 (if available)
+- borrowable or not
+
diff --git a/python/notes/wikipedia_references.md b/python/notes/wikipedia_references.md
new file mode 100644
index 0000000..e4c9b4c
--- /dev/null
+++ b/python/notes/wikipedia_references.md
@@ -0,0 +1,85 @@
+
+## Upstream Projects
+
+There have been a few different research and infrastructure projects to extract
+references from Wikipedia articles.
+
+"Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" (2020)
+https://arxiv.org/abs/2007.07022
+https://github.com/Harshdeep1996/cite-classifications-wiki
+http://doi.org/10.5281/zenodo.3940692
+
+> A total of 29.3M citations were extracted from 6.1M English Wikipedia
+> articles as of May 2020, and classified as being to books, journal articles
+> or Web contents.  We were thus able to extract 4.0M citations to scholarly
+> publications with known identifiers — including DOI, PMC, PMID, and ISBN
+
+Seems to strive for being updated and getting integrated into other services
+(like opencitations). Dataset release is in parquet files. Includes some
+partial resolution of citations which lack identifiers, using the crossref API.
+
+"Citations with identifiers in Wikipedia" (~2018)
+https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/
+https://figshare.com/articles/dataset/Citations_with_identifiers_in_Wikipedia/1299540/1
+
+This was a Wikimedia Foundation effort. Covers all language sites, which is
+great, but is out of date (not ongoing), and IIRC only includes works with a
+known PID (DOI, ISBN, etc).
+
+"Quantifying Engagement with Citations on Wikipedia" (2020)
+
+"Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles" (2020)
+https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15608
+
+"'I Updated the <ref>': The Evolution of References in the English Wikipedia and the Implications for Altmetrics" (2020)
+
+Very sophisticated analysis of changes/edits to individual references over
+time. Eg, by tokenizing and looking at edit history. Not relevant for us,
+probably, though they can show how old a reference is. Couldn't find an actual
+download location for the dataset.
+
+## "Wikipedia Citations" Dataset
+
+    lookup_data.zip
+        Crossref API objects
+        single JSON file per DOI (many JSON files)
+
+    minimal_dataset.zip
+        many parquet files (sharded), snappy-compressed
+        subset of citations_from_wikipedia.zip
+
+    citations_from_wikipedia.zip
+        many parquet files (sharded), snappy-compressed
+
+Attempting to use `parquet-tools` pip packages (not the "official"
+`parquet-tools` command) to dump out as... CSV?
+
+    # in a virtualenv/pipenv
+    pip install python-snappy
+    pip install parquet
+
+    # XXX
+    parquet --format json
+
+## Final Metadata Fields
+
+For the final BiblioRef object:
+
+    _key: ("wikipedia", source_wikipedia_article, ref_index)
+    source_wikipedia_article: Optional[str]
+        with lang prefix like "en:Superglue"
+    source_year: Optional[int]
+        current year? or article created?
+
+    ref_index: int
+        1-indexed, not 0-indexed
+    ref_key: Optional[str]
+        eg, "Lee86", "BIB23"
+
+    match_provenance: wikipedia
+
+    get_unstructured: string (only if no release_ident link/match)
+    target_csl: free-form JSON (only if no release_ident link/match)
+        CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
+        generated from unstructured by a GROBID parse, if needed
+
author	Martin Czygan <martin.czygan@gmail.com>	2021-03-21 01:39:13 +0100
committer	Martin Czygan <martin.czygan@gmail.com>	2021-03-21 01:39:13 +0100
commit	1eae78c37dcb605c369d977f4ad764694603641b (patch)
tree	45c6f500c8d004f341939b2c027fba30147ed7cd /python/notes
parent	6af4de12553fe1fdbb1e08342df0a84052e985cb (diff)
download	refcat-1eae78c37dcb605c369d977f4ad764694603641b.tar.gz refcat-1eae78c37dcb605c369d977f4ad764694603641b.zip