1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
|
This document describes how to parse references out of Wikipedia bulk XML
dumps, using the `wikiciteparser` python package, for use in the refcat
citation matching pipeline.
Unfortunately, due to limitations in `wikiciteparser` (and the complexity of
Wikipedia citation formatting across language instances), this pipeline only
works with the English version of Wikipedia (enwiki).
## Download Bulk XML Snapshot
You can find documentation and links to recent snapshots at
<https://dumps.wikimedia.org/backup-index.html>. We want the
`-pages-articles.xml.bz2` files, which includes article text for the most
recent version of articles. If we download the set of smaller individual files,
instead of the single combined file, we can parallelize processing later.
A hack-y way to download all the files is to copy/paste the list of URLs from
the web listing, put them in a file called `urls.txt`, then run a command like:
cat urls.txt | parallel -j2 wget --quiet -c https://dumps.wikimedia.org/enwiki/20211201/{}
## Install `wikiciteparser`
To use the official/released version, in a virtualenv (or similar), run:
pip install wikiciteparser
Or, do a git checkout of <https://github.com/dissemin/wikiciteparser>.
## Run Parallel Command
Within a virtualenv, use `parallel` to process like:
ls /fast/download/enwiki-20211201-pages-articles/enwiki*.bz2 \
| parallel -j12 --line-buffer python3 -m wikiciteparser.bulk {} \
| pv -l \
| gzip \
> enwiki-20211201-pages-articles.citations.json.gz
This will output JSON lines, one line per article, with the article title,
revision, site name, and any extracted references in a sub-array (of JSON
objects). As of December 2021, it takes about 17 hours on a large machine, with
the above command.
## Prior Work
Similar projects include:
* [Harshdeep1996/cite-classifications-wiki](https://github.com/Harshdeep1996/cite-classifications-wiki):
uses `wikiciteparser` and PySpark to extract references from bulk XML,
outputs parquet. Requires a Spark cluster/environment to run. (itself used by
[Wikipedia Citations in Wikidata](https://github.com/opencitations/wcw))
* [python-mwcites](https://github.com/mediawiki-utilities/python-mwcites): uses
`python-mwxml` to iterate over bulk XML, has relatively simple identifier
extraction
* [gesiscss/wikipedia_references](https://github.com/gesiscss/wikipedia_references):
oriented towards tracking edits to individual references over time
|