diff options
author | Martin Czygan <martin@archive.org> | 2021-12-18 00:18:46 +0000 |
---|---|---|
committer | Martin Czygan <martin@archive.org> | 2021-12-18 00:18:46 +0000 |
commit | 3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0 (patch) | |
tree | d6289ab22ead66eb28190ff07e00af6ab0f35306 /extra/wikipedia/README.md | |
parent | 3867fcab91244650a1e2fd9bba165a54c4e810e5 (diff) | |
parent | fa557a90482cfed59564173e442d9375b959ee8b (diff) | |
download | refcat-3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0.tar.gz refcat-3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0.zip |
Merge branch 'bnewbold-wikipedia-notes' into 'master'
wikipedia refs prep notes, and stats from 20210801 run
See merge request webgroup/refcat!5
Diffstat (limited to 'extra/wikipedia/README.md')
-rw-r--r-- | extra/wikipedia/README.md | 61 |
1 files changed, 61 insertions, 0 deletions
diff --git a/extra/wikipedia/README.md b/extra/wikipedia/README.md new file mode 100644 index 0000000..59480a7 --- /dev/null +++ b/extra/wikipedia/README.md @@ -0,0 +1,61 @@ + +This document describes how to parse references out of Wikipedia bulk XML +dumps, using the `wikiciteparser` python package, for use in the refcat +citation matching pipeline. + +Unfortunately, due to limitations in `wikiciteparser` (and the complexity of +Wikipedia citation formatting across language instances), this pipeline only +works with the English version of Wikipedia (enwiki). + + +## Download Bulk XML Snapshot + +You can find documentation and links to recent snapshots at +<https://dumps.wikimedia.org/backup-index.html>. We want the +`-pages-articles.xml.bz2` files, which includes article text for the most +recent version of articles. If we download the set of smaller individual files, +instead of the single combined file, we can parallelize processing later. + +A hack-y way to download all the files is to copy/paste the list of URLs from +the web listing, put them in a file called `urls.txt`, then run a command like: + + cat urls.txt | parallel -j2 wget --quiet -c https://dumps.wikimedia.org/enwiki/20211201/{} + + +## Install `wikiciteparser` + +To use the official/released version, in a virtualenv (or similar), run: + + pip install wikiciteparser + +Or, do a git checkout of <https://github.com/dissemin/wikiciteparser>. + + +## Run Parallel Command + +Within a virtualenv, use `parallel` to process like: + + ls /fast/download/enwiki-20211201-pages-articles/enwiki*.bz2 \ + | parallel -j12 --line-buffer python3 -m wikiciteparser.bulk {} \ + | pv -l \ + | gzip \ + > enwiki-20211201-pages-articles.citations.json.gz + +This will output JSON lines, one line per article, with the article title, +revision, site name, and any extracted references in a sub-array (of JSON +objects). As of December 2021, it takes about 17 hours on a large machine, with +the above command. + +## Prior Work + +Similar projects include: + +* [Harshdeep1996/cite-classifications-wiki](https://github.com/Harshdeep1996/cite-classifications-wiki): + uses `wikiciteparser` and PySpark to extract references from bulk XML, + outputs parquet. Requires a Spark cluster/environment to run. (itself used by + [Wikipedia Citations in Wikidata](https://github.com/opencitations/wcw)) +* [python-mwcites](https://github.com/mediawiki-utilities/python-mwcites): uses + `python-mwxml` to iterate over bulk XML, has relatively simple identifier + extraction +* [gesiscss/wikipedia_references](https://github.com/gesiscss/wikipedia_references): + oriented towards tracking edits to individual references over time |