diff options
-rw-r--r-- | python/notes/openlibrary_works.md | 27 | ||||
-rw-r--r-- | python/notes/wikipedia_references.md | 85 |
2 files changed, 112 insertions, 0 deletions
diff --git a/python/notes/openlibrary_works.md b/python/notes/openlibrary_works.md new file mode 100644 index 0000000..25df527 --- /dev/null +++ b/python/notes/openlibrary_works.md @@ -0,0 +1,27 @@ + +## Upstream Dumps + +Open Library does monthly bulk dumps: <https://archive.org/details/ol_exports?sort=-publicdate> + +Latest work dump: <https://openlibrary.org/data/ol_dump_works_latest.txt.gz> + +TSV columns: + + type - type of record (/type/edition, /type/work etc.) + key - unique key of the record. (/books/OL1M etc.) + revision - revision number of the record + last_modified - last modified timestamp + JSON - the complete record in JSON format + + zcat ol_dump_works_latest.txt.gz | cut -f5 | head | jq . + +We are going to want: + +- title (with "prefix"?) +- authors +- subtitle +- year +- identifier (work? edition?) +- isbn-13 (if available) +- borrowable or not + diff --git a/python/notes/wikipedia_references.md b/python/notes/wikipedia_references.md new file mode 100644 index 0000000..e4c9b4c --- /dev/null +++ b/python/notes/wikipedia_references.md @@ -0,0 +1,85 @@ + +## Upstream Projects + +There have been a few different research and infrastructure projects to extract +references from Wikipedia articles. + +"Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia" (2020) +https://arxiv.org/abs/2007.07022 +https://github.com/Harshdeep1996/cite-classifications-wiki +http://doi.org/10.5281/zenodo.3940692 + +> A total of 29.3M citations were extracted from 6.1M English Wikipedia +> articles as of May 2020, and classified as being to books, journal articles +> or Web contents. We were thus able to extract 4.0M citations to scholarly +> publications with known identifiers — including DOI, PMC, PMID, and ISBN + +Seems to strive for being updated and getting integrated into other services +(like opencitations). Dataset release is in parquet files. Includes some +partial resolution of citations which lack identifiers, using the crossref API. + +"Citations with identifiers in Wikipedia" (~2018) +https://analytics.wikimedia.org/published/datasets/archive/public-datasets/all/mwrefs/mwcites-20180301/ +https://figshare.com/articles/dataset/Citations_with_identifiers_in_Wikipedia/1299540/1 + +This was a Wikimedia Foundation effort. Covers all language sites, which is +great, but is out of date (not ongoing), and IIRC only includes works with a +known PID (DOI, ISBN, etc). + +"Quantifying Engagement with Citations on Wikipedia" (2020) + +"Measuring the quality of scientific references in Wikipedia: an analysis of more than 115M citations to over 800 000 scientific articles" (2020) +https://febs.onlinelibrary.wiley.com/doi/abs/10.1111/febs.15608 + +"'I Updated the <ref>': The Evolution of References in the English Wikipedia and the Implications for Altmetrics" (2020) + +Very sophisticated analysis of changes/edits to individual references over +time. Eg, by tokenizing and looking at edit history. Not relevant for us, +probably, though they can show how old a reference is. Couldn't find an actual +download location for the dataset. + +## "Wikipedia Citations" Dataset + + lookup_data.zip + Crossref API objects + single JSON file per DOI (many JSON files) + + minimal_dataset.zip + many parquet files (sharded), snappy-compressed + subset of citations_from_wikipedia.zip + + citations_from_wikipedia.zip + many parquet files (sharded), snappy-compressed + +Attempting to use `parquet-tools` pip packages (not the "official" +`parquet-tools` command) to dump out as... CSV? + + # in a virtualenv/pipenv + pip install python-snappy + pip install parquet + + # XXX + parquet --format json + +## Final Metadata Fields + +For the final BiblioRef object: + + _key: ("wikipedia", source_wikipedia_article, ref_index) + source_wikipedia_article: Optional[str] + with lang prefix like "en:Superglue" + source_year: Optional[int] + current year? or article created? + + ref_index: int + 1-indexed, not 0-indexed + ref_key: Optional[str] + eg, "Lee86", "BIB23" + + match_provenance: wikipedia + + get_unstructured: string (only if no release_ident link/match) + target_csl: free-form JSON (only if no release_ident link/match) + CSL-JSON schema (similar to ReleaseEntity schema, but not exactly) + generated from unstructured by a GROBID parse, if needed + |