aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-03-21 01:39:13 +0100
committerMartin Czygan <martin.czygan@gmail.com>2021-03-21 01:39:13 +0100
commit1eae78c37dcb605c369d977f4ad764694603641b (patch)
tree45c6f500c8d004f341939b2c027fba30147ed7cd
parent6af4de12553fe1fdbb1e08342df0a84052e985cb (diff)
downloadrefcat-1eae78c37dcb605c369d977f4ad764694603641b.tar.gz
refcat-1eae78c37dcb605c369d977f4ad764694603641b.zip
add ol and wikipedia notes
-rw-r--r--python/notes/openlibrary_works.md27
-rw-r--r--python/notes/wikipedia_references.md85
2 files changed, 112 insertions, 0 deletions
diff --git a/python/notes/openlibrary_works.md b/python/notes/openlibrary_works.md
new file mode 100644
index 0000000..25df527
--- /dev/null
+++ b/python/notes/openlibrary_works.md
@@ -0,0 +1,27 @@
+
+## Upstream Dumps
+
+Open Library does monthly bulk dumps: <https://archive.org/details/ol_exports?sort=-publicdate>
+
+Latest work dump: <https://openlibrary.org/data/ol_dump_works_latest.txt.gz>
+
+TSV columns:
+
+ type - type of record (/type/edition, /type/work etc.)
+ key - unique key of the record. (/books/OL1M etc.)
+ revision - revision number of the record
+ last_modified - last modified timestamp
+ JSON - the complete record in JSON format
+
+ zcat ol_dump_works_latest.txt.gz | cut -f5 | head | jq .
+
+We are going to want:
+
+- title (with "prefix"?)
+- authors
+- subtitle
+- year
+- identifier (work? edition?)
+- isbn-13 (if available)
+- borrowable or not
+
diff --git a/python/notes/wikipedia_references.md b/python/notes/wikipedia_references.md
new file mode 100644
index 0000000..e4c9b4c
--- /dev/null
+++ b/python/notes/wikipedia_references.md
@@ -0,0 +1,85 @@
+
+## Upstream Projects
+
+There have been a few different research and infrastructure projects to extract
+references from Wikipedia articles.
+
+"Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" (2020)
+https://arxiv.org/abs/2007.07022
+https://github.com/Harshdeep1996/cite-classifications-wiki
+http://doi.org/10.5281/zenodo.3940692
+
+> A total of 29.3M citations were extracted from 6.1M English Wikipedia
+> articles as of May 2020, and classified as being to books, journal articles
+> or Web contents. We were thus able to extract 4.0M citations to scholarly
+> publications with known identifiers — including DOI, PMC, PMID, and ISBN
+
+Seems to strive for being updated and getting integrated into other services
+(like opencitations). Dataset release is in parquet files. Includes some
+partial resolution of citations which lack identifiers, using the crossref API.
+
+"Citations with identifiers in Wikipedia" (~2018)
+https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/
+https://figshare.com/articles/dataset/Citations_with_identifiers_in_Wikipedia/1299540/1
+
+This was a Wikimedia Foundation effort. Covers all language sites, which is
+great, but is out of date (not ongoing), and IIRC only includes works with a
+known PID (DOI, ISBN, etc).
+
+"Quantifying Engagement with Citations on Wikipedia" (2020)
+
+"Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles" (2020)
+https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15608
+
+"'I Updated the <ref>': The Evolution of References in the English Wikipedia and the Implications for Altmetrics" (2020)
+
+Very sophisticated analysis of changes/edits to individual references over
+time. Eg, by tokenizing and looking at edit history. Not relevant for us,
+probably, though they can show how old a reference is. Couldn't find an actual
+download location for the dataset.
+
+## "Wikipedia Citations" Dataset
+
+ lookup_data.zip
+ Crossref API objects
+ single JSON file per DOI (many JSON files)
+
+ minimal_dataset.zip
+ many parquet files (sharded), snappy-compressed
+ subset of citations_from_wikipedia.zip
+
+ citations_from_wikipedia.zip
+ many parquet files (sharded), snappy-compressed
+
+Attempting to use `parquet-tools` pip packages (not the "official"
+`parquet-tools` command) to dump out as... CSV?
+
+ # in a virtualenv/pipenv
+ pip install python-snappy
+ pip install parquet
+
+ # XXX
+ parquet --format json
+
+## Final Metadata Fields
+
+For the final BiblioRef object:
+
+ _key: ("wikipedia", source_wikipedia_article, ref_index)
+ source_wikipedia_article: Optional[str]
+ with lang prefix like "en:Superglue"
+ source_year: Optional[int]
+ current year? or article created?
+
+ ref_index: int
+ 1-indexed, not 0-indexed
+ ref_key: Optional[str]
+ eg, "Lee86", "BIB23"
+
+ match_provenance: wikipedia
+
+ get_unstructured: string (only if no release_ident link/match)
+ target_csl: free-form JSON (only if no release_ident link/match)
+ CSL-JSON schema (similar to ReleaseEntity schema, but not exactly)
+ generated from unstructured by a GROBID parse, if needed
+