## Upstream Projects There have been a few different research and infrastructure projects to extract references from Wikipedia articles. "Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" (2020) https://arxiv.org/abs/2007.07022 https://github.com/Harshdeep1996/cite-classifications-wiki http://doi.org/10.5281/zenodo.3940692 > A total of 29.3M citations were extracted from 6.1M English Wikipedia > articles as of May 2020, and classified as being to books, journal articles > or Web contents. We were thus able to extract 4.0M citations to scholarly > publications with known identifiers — including DOI, PMC, PMID, and ISBN Seems to strive for being updated and getting integrated into other services (like opencitations). Dataset release is in parquet files. Includes some partial resolution of citations which lack identifiers, using the crossref API. "Citations with identifiers in Wikipedia" (~2018) https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/ https://figshare.com/articles/dataset/Citations_with_identifiers_in_Wikipedia/1299540/1 This was a Wikimedia Foundation effort. Covers all language sites, which is great, but is out of date (not ongoing), and IIRC only includes works with a known PID (DOI, ISBN, etc). "Quantifying Engagement with Citations on Wikipedia" (2020) "Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles" (2020) https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15608 "'I Updated the ': The Evolution of References in the English Wikipedia and the Implications for Altmetrics" (2020) Very sophisticated analysis of changes/edits to individual references over time. Eg, by tokenizing and looking at edit history. Not relevant for us, probably, though they can show how old a reference is. Couldn't find an actual download location for the dataset. ## "Wikipedia Citations" Dataset lookup_data.zip Crossref API objects single JSON file per DOI (many JSON files) minimal_dataset.zip many parquet files (sharded), snappy-compressed subset of citations_from_wikipedia.zip citations_from_wikipedia.zip many parquet files (sharded), snappy-compressed Attempting to use `parquet-tools` pip packages (not the "official" `parquet-tools` command) to dump out as... CSV? # in a virtualenv/pipenv pip install python-snappy pip install parquet # XXX parquet --format json ## Final Metadata Fields For the final BiblioRef object: _key: ("wikipedia", source_wikipedia_article, ref_index) source_wikipedia_article: Optional[str] with lang prefix like "en:Superglue" source_year: Optional[int] current year? or article created? ref_index: int 1-indexed, not 0-indexed ref_key: Optional[str] eg, "Lee86", "BIB23" match_provenance: wikipedia get_unstructured: string (only if no release_ident link/match) target_csl: free-form JSON (only if no release_ident link/match) CSL-JSON schema (similar to ReleaseEntity schema, but not exactly) generated from unstructured by a GROBID parse, if needed