This document describes how to parse references out of Wikipedia bulk XML dumps, using the `wikiciteparser` python package, for use in the refcat citation matching pipeline. Unfortunately, due to limitations in `wikiciteparser` (and the complexity of Wikipedia citation formatting across language instances), this pipeline only works with the English version of Wikipedia (enwiki). ## Download Bulk XML Snapshot You can find documentation and links to recent snapshots at . We want the `-pages-articles.xml.bz2` files, which includes article text for the most recent version of articles. If we download the set of smaller individual files, instead of the single combined file, we can parallelize processing later. A hack-y way to download all the files is to copy/paste the list of URLs from the web listing, put them in a file called `urls.txt`, then run a command like: cat urls.txt | parallel -j2 wget --quiet -c https://dumps.wikimedia.org/enwiki/20211201/{} ## Install `wikiciteparser` To use the official/released version, in a virtualenv (or similar), run: pip install wikiciteparser Or, do a git checkout of . ## Run Parallel Command Within a virtualenv, use `parallel` to process like: ls /fast/download/enwiki-20211201-pages-articles/enwiki*.bz2 \ | parallel -j12 --line-buffer python3 -m wikiciteparser.bulk {} \ | pv -l \ | gzip \ > enwiki-20211201-pages-articles.citations.json.gz This will output JSON lines, one line per article, with the article title, revision, site name, and any extracted references in a sub-array (of JSON objects). As of December 2021, it takes about 17 hours on a large machine, with the above command. ## Prior Work Similar projects include: * [Harshdeep1996/cite-classifications-wiki](https://github.com/Harshdeep1996/cite-classifications-wiki): uses `wikiciteparser` and PySpark to extract references from bulk XML, outputs parquet. Requires a Spark cluster/environment to run. (itself used by [Wikipedia Citations in Wikidata](https://github.com/opencitations/wcw)) * [python-mwcites](https://github.com/mediawiki-utilities/python-mwcites): uses `python-mwxml` to iterate over bulk XML, has relatively simple identifier extraction * [gesiscss/wikipedia_references](https://github.com/gesiscss/wikipedia_references): oriented towards tracking edits to individual references over time