Merge branch 'bnewbold-wikipedia-notes' into 'master'

wikipedia refs prep notes, and stats from 20210801 run See merge request webgroup/refcat!5
author: Martin Czygan <martin@archive.org> 2021-12-18 00:18:46 +0000
committer: Martin Czygan <martin@archive.org> 2021-12-18 00:18:46 +0000
commit: 3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0 (patch)
tree: d6289ab22ead66eb28190ff07e00af6ab0f35306 /extra/wikipedia/README.md
parent: 3867fcab91244650a1e2fd9bba165a54c4e810e5 (diff)
parent: fa557a90482cfed59564173e442d9375b959ee8b (diff)
download: refcat-3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0.tar.gz
refcat-3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0.zip
1 files changed, 61 insertions, 0 deletions
diff --git a/extra/wikipedia/README.md b/extra/wikipedia/README.md
new file mode 100644
index 0000000..59480a7
--- /dev/null
+++ b/extra/wikipedia/README.md
@@ -0,0 +1,61 @@
+
+This document describes how to parse references out of Wikipedia bulk XML
+dumps, using the `wikiciteparser` python package, for use in the refcat
+citation matching pipeline.
+
+Unfortunately, due to limitations in `wikiciteparser` (and the complexity of
+Wikipedia citation formatting across language instances), this pipeline only
+works with the English version of Wikipedia (enwiki).
+
+
+## Download Bulk XML Snapshot
+
+You can find documentation and links to recent snapshots at
+<https://dumps.wikimedia.org/backup-index.html>. We want the
+`-pages-articles.xml.bz2` files, which includes article text for the most
+recent version of articles. If we download the set of smaller individual files,
+instead of the single combined file, we can parallelize processing later.
+
+A hack-y way to download all the files is to copy/paste the list of URLs from
+the web listing, put them in a file called `urls.txt`, then run a command like:
+
+    cat urls.txt | parallel -j2 wget --quiet -c https://dumps.wikimedia.org/enwiki/20211201/{}
+
+
+## Install `wikiciteparser`
+
+To use the official/released version, in a virtualenv (or similar), run:
+
+    pip install wikiciteparser
+
+Or, do a git checkout of <https://github.com/dissemin/wikiciteparser>.
+
+
+## Run Parallel Command
+
+Within a virtualenv, use `parallel` to process like:
+
+    ls /fast/download/enwiki-20211201-pages-articles/enwiki*.bz2 \
+        | parallel -j12 --line-buffer python3 -m wikiciteparser.bulk {} \
+        | pv -l \
+        | gzip \
+        > enwiki-20211201-pages-articles.citations.json.gz
+
+This will output JSON lines, one line per article, with the article title,
+revision, site name, and any extracted references in a sub-array (of JSON
+objects). As of December 2021, it takes about 17 hours on a large machine, with
+the above command.
+
+## Prior Work
+
+Similar projects include:
+
+* [Harshdeep1996/cite-classifications-wiki](https://github.com/Harshdeep1996/cite-classifications-wiki):
+  uses `wikiciteparser` and PySpark to extract references from bulk XML,
+  outputs parquet. Requires a Spark cluster/environment to run. (itself used by
+  [Wikipedia Citations in Wikidata](https://github.com/opencitations/wcw))
+* [python-mwcites](https://github.com/mediawiki-utilities/python-mwcites): uses
+  `python-mwxml` to iterate over bulk XML, has relatively simple identifier
+  extraction
+* [gesiscss/wikipedia_references](https://github.com/gesiscss/wikipedia_references):
+  oriented towards tracking edits to individual references over time
author	Martin Czygan <martin@archive.org>	2021-12-18 00:18:46 +0000
committer	Martin Czygan <martin@archive.org>	2021-12-18 00:18:46 +0000
commit	3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0 (patch)
tree	d6289ab22ead66eb28190ff07e00af6ab0f35306 /extra/wikipedia/README.md
parent	3867fcab91244650a1e2fd9bba165a54c4e810e5 (diff)
parent	fa557a90482cfed59564173e442d9375b959ee8b (diff)
download	refcat-3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0.tar.gz refcat-3c0cae2b81dbd4ff7621cf9b7e4a6183352984f0.zip