![](static/6dSaW2q.png) `refcat`: large-scale citation graph generation tools ===================================================== An assembly of software tools in Python and Go, which together are used to compile a citation graph with billions of edges (references) and hundreds of millions of nodes (papers). Maintained by [martin@archive.org](mailto:martin@archive.org) at the Internet Archive, as part of the [fatcat](https://fatcat.wiki) and [scholar.archive.org](https://scholar.archive.org) projects. Code is organized into sub-modules, with their own documentation: * [python/](python/README.md): mostly [luigi](https://github.com/spotify/luigi) tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments) * [skate/](skate/README.md): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks The python code also builds on top of the [fuzzycat](https://pypi.org/project/fuzzycat/) library. A first version of the citation graph dataset has been uploaded on Aug 7, 2021 to [https://archive.org/details/refcat_2021-07-28](https://archive.org/details/refcat_2021-07-28). You can find additional information on the project in the [fatcat guide](https://guide.fatcat.wiki/reference_graph.html), [blog post](https://blog.archive.org/2021/10/19/internet-archive-releases-refcat-the-ia-scholar-index-of-over-1-3-billion-scholarly-citations/) and in a [technical report](https://arxiv.org/abs/2110.06595). ## Overview The high level goals of this project are: * deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata * beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia) * publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report) The main challenges are: * currently 2.5B references documents (~1TB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) * currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x space, 2x speedup) * partial metadata (requiring separate code paths) * data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since a good chunk of the metadata comes from ML based [PDF metadata extraction](https://grobid.readthedocs.io)) * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute) Internet Archive use cases for the output citation graph include: * discovery tool, e.g. "cited by ..." link on [fatcat.wiki](https://fatcat.wiki/release/bza3ovudezahlexibdtoytgtb4/refs-in) * lookup things cited by a [wikipedia page](https://fatcat.wiki/wikipedia/en:Internet/refs-out), papers citing [books](https://fatcat.wiki/openlibrary/OL2141999W/refs-in) or papers referencing web pages (wip) * metadata discovery; e.g. identify popularly cited works which are missing (aka, have [no *matched*](https://git.archive.org/webgroup/refcat/-/blob/eb6dec279d66d35433f0ea7df1c1399896b111ce/python/refcat/tasks.py#L461-488) record in the catalog) * Turn All References Blue (TARB, [notes](https://meta.wikimedia.org/wiki/GLAMTLV2018/Submissions/Turn_All_References_Blue!), [presentation](https://archive.org/details/mark-graham-presentation)) Original design documents for this project are included in the fatcat git repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md), [Reference Graph API and Schema (Jan 2021)](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md) ## Progress We use informal, internal versioning for the graph currently v3, next will be v4/v5. Current status (version 2): * matches via: doi, arxiv, pmid, pmcid, fuzzy title matches * 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed Notes by iteration: * [python/notes/version_0.md](python/notes/version_0.md) * [python/notes/version_1.md](python/notes/version_1.md) * [python/notes/version_2.md](python/notes/version_2.md) * [python/notes/version_3.md](python/notes/version_3.md) ## Support and Acknowledgements Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the ["Ensuring the Persistent Access of Open Access Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)). Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).