aboutsummaryrefslogtreecommitdiffstats
path: root/extra/wikipedia/README.md
blob: 8cfdfc0edc28b65267d1c7c5d8c9b110581b1cc2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

This document describes how to parse references out of Wikipedia bulk XML
dumps, using the `wikiciteparser` python package, for use in the refcat
citation matching pipeline.

Unfortunately, due to limitations in `wikiciteparser` (and the complexity of
Wikipedia citation formatting across language instances), this pipeline only
works with the English version of Wikipedia (enwiki).


## Download Bulk XML Snapshot

You can find documentation and links to recent snapshots at
<https://dumps.wikimedia.org/backup-index.html>. We want the
`-pages-articles.xml.bz2` files, which includes article text for the most
recent version of articles. If we download the set of smaller individual files,
instead of the single combined file, we can parallelize processing later.

A hack-y way to download all the files is to copy/paste the list of URLs from
the web listing, put them in a file called `urls.txt`, then run a command like:

    cat urls.txt | parallel -j2 wget --quiet -c https://dumps.wikimedia.org/enwiki/20211201/{}


## Install `wikiciteparser`

To use the official/released version, in a virtualenv (or similar), run:

    pip install wikiciteparser

Or, do a git checkout of <https://github.com/dissemin/wikiciteparser>.


## Run Parallel Command

Within a virtualenv, use `parallel` to process like:

    ls /fast/download/enwiki-20211201-pages-articles/enwiki*.bz2 \
        | parallel -j12 --line-buffer python3 -m wikiciteparser.bulk {} \
        | pv -l \
        | gzip \
        > enwiki-20211201-pages-articles.citations.json.gz

This will output JSON lines, one line per article, with the article title,
revision, site name, and any extracted references in a sub-array (of JSON
objects).

## Prior Work

Similar projects include:

* [Harshdeep1996/cite-classifications-wiki](https://github.com/Harshdeep1996/cite-classifications-wiki):
  uses `wikiciteparser` and PySpark to extract references from bulk XML,
  outputs parquet. Requires a Spark cluster/environment to run. (itself used by
  [Wikipedia Citations in Wikidata](https://github.com/opencitations/wcw))
* [python-mwcites](https://github.com/mediawiki-utilities/python-mwcites): uses
  `python-mwxml` to iterate over bulk XML, has relatively simple identifier
  extraction
* [gesiscss/wikipedia_references](https://github.com/gesiscss/wikipedia_references):
  oriented towards tracking edits to individual references over time