# Building a Citation Graph

* date: 2021-04-23
* status: implemented in [refcat](https://gitlab.com/internetarchive/refcat)

## Problem and Goal

We want to generate a citation graph including bibliographic data from fatcat,
open library, wikipedia and other sources; we also want to include archived web
pages referenced in papers.

Citations indices and graphs can be traced back at least to the seminal paper
*Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2]
published in 2005 already lists 17 services that include cited reference
search. Citation counts are common elements on scholarly search engine sites.

We are working with two main document types: a catalog record and a entry
describing a citation. Both can contain partial information only.

## A Funnel Approach

To link a reference entry to a catalog record we use a funnel approach. That
is, we start with the most common (or the easiest) pattern in the data, then
iterate and look at harder or more obscure patterns.

The simplest and most reliable way of linkage is by persitent identifier (PID)
or per-source unique identifier (such as PubMed ID). If no identifier is
available, we fall back to a fuzzy matching and verification approach, that
implements data specific rules for matching.

## Implementation

A goal is to start small, and eventuelly move to a canonical data framework for
processing, if appropriate or necessary [3].

Especially we would like to make it fast to analyze a few billion reference
entries in a reasonable amount of time with little setup and minimal resource
dependencies.

We use a *map-reduce* like processing model. Especially we derive a key from a
document and pass (key, document) tuples sharing a key to a reduce function,
which performs additional computation, such as verification or reference schema
generation (e.g. a JSON document representing an edge in the citation graph).

This approach allows us to work with exact identifiers, as well as fuzzy
matching over partial data.

----

# Refs

* [1] http://garfield.library.upenn.edu/papers/science1955.pdf
* [2] https://authors.library.caltech.edu/24838/1/ROTcs05.pdf
* [3] As of 04/2021 the total input size is about 1.6TB uncompressed JSON documents.