refcat
: large-scale citation graph generation tools
An assembly of software tools in Python and Go, which together are used to compile a citation graph with billions of edges (references) and hundreds of millions of nodes (papers).
Maintained by martin@archive.org at the Internet Archive, as part of the fatcat and scholar.archive.org projects.
Code is organized into sub-modules, with their own documentation:
- python/: mostly luigi tasks (using shiv for single-file deployments)
- skate/: various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks
The python code also builds on top of the fuzzycat library.
A first version of the citation graph dataset has been uploaded on Aug 7, 2021 to https://archive.org/details/refcat_2021-07-28. You can find additional information on the project in the fatcat guide, blog post and in a technical report.
Overview
The high level goals of this project are:
- deriving a citation graph dataset from scholarly metadata
- beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
- publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)
The main challenges are:
- currently 2.5B references documents (~1TB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
- currently a single machine setup (16 cores, 16T disk; note: we compress with zstd, which gives us about 5x space, 2x speedup)
- partial metadata (requiring separate code paths)
- data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since a good chunk of the metadata comes from ML based PDF metadata extraction)
- fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)
Internet Archive use cases for the output citation graph include:
- discovery tool, e.g. "cited by ..." link on fatcat.wiki
- lookup things cited by a wikipedia page, papers citing books or papers referencing web pages (wip)
- metadata discovery; e.g. identify popularly cited works which are missing (aka, have no matched record in the catalog)
- Turn All References Blue (TARB, notes, presentation)
Original design documents for this project are included in the fatcat git repository: Bulk Citation Graph (Oct 2020), Reference Graph API and Schema (Jan 2021)
Progress
We use informal, internal versioning for the graph currently v3, next will be v4/v5.
Current status (version 2):
- matches via: doi, arxiv, pmid, pmcid, fuzzy title matches
- 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed
Notes by iteration:
- python/notes/version_0.md
- python/notes/version_1.md
- python/notes/version_2.md
- python/notes/version_3.md
Support and Acknowledgements
Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the "Ensuring the Persistent Access of Open Access Journal Literature" project (see original announcement).
Additional acknowledgements at fatcat.wiki.