1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
|
## Upstream Projects
There have been a few different research and infrastructure projects to extract
references from Wikipedia articles.
"Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" (2020)
https://arxiv.org/abs/2007.07022
https://github.com/Harshdeep1996/cite-classifications-wiki
http://doi.org/10.5281/zenodo.3940692
> A total of 29.3M citations were extracted from 6.1M English Wikipedia
> articles as of May 2020, and classified as being to books, journal articles
> or Web contents. We were thus able to extract 4.0M citations to scholarly
> publications with known identifiers — including DOI, PMC, PMID, and ISBN
Seems to strive for being updated and getting integrated into other services
(like opencitations). Dataset release is in parquet files. Includes some
partial resolution of citations which lack identifiers, using the crossref API.
"Citations with identifiers in Wikipedia" (~2018)
https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/
https://figshare.com/articles/dataset/Citations_with_identifiers_in_Wikipedia/1299540/1
This was a Wikimedia Foundation effort. Covers all language sites, which is
great, but is out of date (not ongoing), and IIRC only includes works with a
known PID (DOI, ISBN, etc).
"Quantifying Engagement with Citations on Wikipedia" (2020)
"Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles" (2020)
https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15608
"'I Updated the <ref>': The Evolution of References in the English Wikipedia and the Implications for Altmetrics" (2020)
Very sophisticated analysis of changes/edits to individual references over
time. Eg, by tokenizing and looking at edit history. Not relevant for us,
probably, though they can show how old a reference is. Couldn't find an actual
download location for the dataset.
## "Wikipedia Citations" Dataset
lookup_data.zip
Crossref API objects
single JSON file per DOI (many JSON files)
minimal_dataset.zip
many parquet files (sharded), snappy-compressed
subset of citations_from_wikipedia.zip
citations_from_wikipedia.zip
many parquet files (sharded), snappy-compressed
Attempting to use `parquet-tools` pip packages (not the "official"
`parquet-tools` command) to dump out as... CSV?
# in a virtualenv/pipenv
pip install python-snappy
pip install parquet
# XXX
parquet --format json
## Final Metadata Fields
For the final BiblioRef object:
_key: ("wikipedia", source_wikipedia_article, ref_index)
source_wikipedia_article: Optional[str]
with lang prefix like "en:Superglue"
source_year: Optional[int]
current year? or article created?
ref_index: int
1-indexed, not 0-indexed
ref_key: Optional[str]
eg, "Lee86", "BIB23"
match_provenance: wikipedia
get_unstructured: string (only if no release_ident link/match)
target_csl: free-form JSON (only if no release_ident link/match)
CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
generated from unstructured by a GROBID parse, if needed
|