1 files changed, 85 insertions, 0 deletions
diff --git a/python/notes/wikipedia_references.md b/python/notes/wikipedia_references.md
new file mode 100644
index 0000000..e4c9b4c
--- /dev/null
+++ b/python/notes/wikipedia_references.md
@@ -0,0 +1,85 @@
+
+## Upstream Projects
+
+There have been a few different research and infrastructure projects to extract
+references from Wikipedia articles.
+
+"Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" (2020)
+https://arxiv.org/abs/2007.07022
+https://github.com/Harshdeep1996/cite-classifications-wiki
+http://doi.org/10.5281/zenodo.3940692
+
+> A total of 29.3M citations were extracted from 6.1M English Wikipedia
+> articles as of May 2020, and classified as being to books, journal articles
+> or Web contents.  We were thus able to extract 4.0M citations to scholarly
+> publications with known identifiers — including DOI, PMC, PMID, and ISBN
+
+Seems to strive for being updated and getting integrated into other services
+(like opencitations). Dataset release is in parquet files. Includes some
+partial resolution of citations which lack identifiers, using the crossref API.
+
+"Citations with identifiers in Wikipedia" (~2018)
+https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/
+https://figshare.com/articles/dataset/Citations_with_identifiers_in_Wikipedia/1299540/1
+
+This was a Wikimedia Foundation effort. Covers all language sites, which is
+great, but is out of date (not ongoing), and IIRC only includes works with a
+known PID (DOI, ISBN, etc).
+
+"Quantifying Engagement with Citations on Wikipedia" (2020)
+
+"Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles" (2020)
+https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15608
+
+"'I Updated the <ref>': The Evolution of References in the English Wikipedia and the Implications for Altmetrics" (2020)
+
+Very sophisticated analysis of changes/edits to individual references over
+time. Eg, by tokenizing and looking at edit history. Not relevant for us,
+probably, though they can show how old a reference is. Couldn't find an actual
+download location for the dataset.
+
+## "Wikipedia Citations" Dataset
+
+    lookup_data.zip
+        Crossref API objects
+        single JSON file per DOI (many JSON files)
+
+    minimal_dataset.zip
+        many parquet files (sharded), snappy-compressed
+        subset of citations_from_wikipedia.zip
+
+    citations_from_wikipedia.zip
+        many parquet files (sharded), snappy-compressed
+
+Attempting to use `parquet-tools` pip packages (not the "official"
+`parquet-tools` command) to dump out as... CSV?
+
+    # in a virtualenv/pipenv
+    pip install python-snappy
+    pip install parquet
+
+    # XXX
+    parquet --format json
+
+## Final Metadata Fields
+
+For the final BiblioRef object:
+
+    _key: ("wikipedia", source_wikipedia_article, ref_index)
+    source_wikipedia_article: Optional[str]
+        with lang prefix like "en:Superglue"
+    source_year: Optional[int]
+        current year? or article created?
+
+    ref_index: int
+        1-indexed, not 0-indexed
+    ref_key: Optional[str]
+        eg, "Lee86", "BIB23"
+
+    match_provenance: wikipedia
+
+    get_unstructured: string (only if no release_ident link/match)
+    target_csl: free-form JSON (only if no release_ident link/match)
+        CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
+        generated from unstructured by a GROBID parse, if needed
+