python/notes/wikipedia_references.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85


## Upstream Projects

There have been a few different research and infrastructure projects to extract
references from Wikipedia articles.

"Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" (2020)
https://arxiv.org/abs/2007.07022
https://github.com/Harshdeep1996/cite-classifications-wiki
http://doi.org/10.5281/zenodo.3940692

> A total of 29.3M citations were extracted from 6.1M English Wikipedia
> articles as of May 2020, and classified as being to books, journal articles
> or Web contents.  We were thus able to extract 4.0M citations to scholarly
> publications with known identifiers — including DOI, PMC, PMID, and ISBN

Seems to strive for being updated and getting integrated into other services
(like opencitations). Dataset release is in parquet files. Includes some
partial resolution of citations which lack identifiers, using the crossref API.

"Citations with identifiers in Wikipedia" (~2018)
https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/
https://figshare.com/articles/dataset/Citations_with_identifiers_in_Wikipedia/1299540/1

This was a Wikimedia Foundation effort. Covers all language sites, which is
great, but is out of date (not ongoing), and IIRC only includes works with a
known PID (DOI, ISBN, etc).

"Quantifying Engagement with Citations on Wikipedia" (2020)

"Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles" (2020)
https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15608

"'I Updated the <ref>': The Evolution of References in the English Wikipedia and the Implications for Altmetrics" (2020)

Very sophisticated analysis of changes/edits to individual references over
time. Eg, by tokenizing and looking at edit history. Not relevant for us,
probably, though they can show how old a reference is. Couldn't find an actual
download location for the dataset.

## "Wikipedia Citations" Dataset

    lookup_data.zip
        Crossref API objects
        single JSON file per DOI (many JSON files)

    minimal_dataset.zip
        many parquet files (sharded), snappy-compressed
        subset of citations_from_wikipedia.zip

    citations_from_wikipedia.zip
        many parquet files (sharded), snappy-compressed

Attempting to use `parquet-tools` pip packages (not the "official"
`parquet-tools` command) to dump out as... CSV?

    # in a virtualenv/pipenv
    pip install python-snappy
    pip install parquet

    # XXX
    parquet --format json

## Final Metadata Fields

For the final BiblioRef object:

    _key: ("wikipedia", source_wikipedia_article, ref_index)
    source_wikipedia_article: Optional[str]
        with lang prefix like "en:Superglue"
    source_year: Optional[int]
        current year? or article created?

    ref_index: int
        1-indexed, not 0-indexed
    ref_key: Optional[str]
        eg, "Lee86", "BIB23"

    match_provenance: wikipedia

    get_unstructured: string (only if no release_ident link/match)
    target_csl: free-form JSON (only if no release_ident link/match)
        CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
        generated from unstructured by a GROBID parse, if needed