![](https://i.imgur.com/6dSaW2q.png) `cgraph`: large-scale citation graph generation tools ===================================================== An assembly of software tools in Python and Go, which together are used to compile a citation graph with billions of edges (references) and hundreds of millions of nodes (papers). Maintained by [martin@archive.org](mailto:martin@archive.org) at the Internet Archive, as part of the [fatcat](https://fatcat.wiki) and [scholar.archive.org](https://scholar.archive.org) projects. Code is organized into sub-modules, with their own documentation: * [python/](python/README.md): mostly [luigi](https://github.com/spotify/luigi) tasks (using [shiv](https://github.com/linkedin/shiv) for single-file deployments) * [skate/](skate/README.md): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks The python code also builds on top of the [fuzzycat](https://pypi.org/project/fuzzycat/) library. As of June 2021, a copy of the citation graph has not been uploaded publicly, but is expected to be available soon. ## Overview The high level goals of this project are: * deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata * beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia) * publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report) The main challenges are: * currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data) * currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space) * partial metadata (requiring separate code paths) * data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io)) * fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute) Internet Archive use cases for the output citation graph include: * discovery tool, e.g. "cited by ..." link for scholar.archive.org * lookup things citing this page/book/website/... * metadata discovery; e.g. identify popularly cited works which are missing (aka, have no "matched" record in the catalog) * Turn All References Blue (TARB) Original design documents for this project are included in the fatcat git repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md), [Reference Graph API and Schema (Jan 2021](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md) ## Progress We use informal, internal versioning for the graph currently v3, next will be v4/v5. Current status (version 2): * matches via: doi, arxiv, pmid, pmcid, fuzzy title matches * 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed Notes by iteration: * [python/notes/version_0.md](python/notes/version_0.md) * [python/notes/version_1.md](python/notes/version_1.md) * [python/notes/version_2.md](python/notes/version_2.md) * [python/notes/version_3.md](python/notes/version_3.md) ## Support and Acknowledgements Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the ["Ensuring the Persistent Access of Open Access Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)). Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).