![](https://i.imgur.com/6dSaW2q.png)

`cgraph`: large-scale citation graph generation tools
=====================================================

An assembly of software tools in Python and Go, which together are used to
compile a citation graph with billions of edges (references) and hundreds of
millions of nodes (papers).

Maintained by [martin@archive.org](mailto:martin@archive.org) at the Internet
Archive, as part of the [fatcat](https://fatcat.wiki) and
[scholar.archive.org](https://scholar.archive.org) projects.

Code is organized into sub-modules, with their own documentation:

* [python/](python/README.md): mostly [luigi](https://github.com/spotify/luigi) tasks (using
  [shiv](https://github.com/linkedin/shiv) for single-file deployments)
* [skate/](skate/README.md): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks

The python code also builds on top of the [fuzzycat](https://pypi.org/project/fuzzycat/) library.

As of June 2021, a copy of the citation graph has not been uploaded publicly, but is expected to be available soon.


## Overview

The high level goals of this project are:

* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)


The main challenges are:

* currently 1.8B references documents (~800GB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
* currently a single machine setup (16 cores, 16T disk; note: we compress with [zstd](https://github.com/facebook/zstd), which gives us about 5x the space)
* partial metadata (requiring separate code paths)
* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc. since about 800M metadata docs come from ML based [PDF metadata extraction](https://grobid.readthedocs.io))
* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)


Internet Archive use cases for the output citation graph include:

* discovery tool, e.g. "cited by ..." link for scholar.archive.org
* lookup things citing this page/book/website/...
* metadata discovery; e.g. identify popularly cited works which are missing (aka, have no "matched" record in the catalog)
* Turn All References Blue (TARB)

Original design documents for this project are included in the fatcat git repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md), [Reference Graph API and Schema (Jan 2021](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md)

## Progress

We use informal, internal versioning for the graph currently v3, next will be v4/v5.

Current status (version 2):

* matches via: doi, arxiv, pmid, pmcid, fuzzy title matches
* 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed

Notes by iteration:

* [python/notes/version_0.md](python/notes/version_0.md)
* [python/notes/version_1.md](python/notes/version_1.md)
* [python/notes/version_2.md](python/notes/version_2.md)
* [python/notes/version_3.md](python/notes/version_3.md)

## Support and Acknowledgements

Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the ["Ensuring the Persistent Access of Open Access Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)).

Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).