# Building a Citation Graph * date: 2021-04-23 * status: implemented in [refcat](https://gitlab.com/internetarchive/refcat) ## Problem and Goal We want to generate a citation graph including bibliographic data from fatcat, open library, wikipedia and other sources; we also want to include archived web pages referenced in papers. Citations indices and graphs can be traced back at least to the seminal paper *Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2] published in 2005 already lists 17 services that include cited reference search. Citation counts are common elements on scholarly search engine sites. We are working with two main document types: a catalog record and a entry describing a citation. Both can contain partial information only. ## A Funnel Approach To link a reference entry to a catalog record we use a funnel approach. That is, we start with the most common (or the easiest) pattern in the data, then iterate and look at harder or more obscure patterns. The simplest and most reliable way of linkage is by persitent identifier (PID) or per-source unique identifier (such as PubMed ID). If no identifier is available, we fall back to a fuzzy matching and verification approach, that implements data specific rules for matching. ## Implementation A goal is to start small, and eventuelly move to a canonical data framework for processing, if appropriate or necessary [3]. Especially we would like to make it fast to analyze a few billion reference entries in a reasonable amount of time with little setup and minimal resource dependencies. We use a *map-reduce* like processing model. Especially we derive a key from a document and pass (key, document) tuples sharing a key to a reduce function, which performs additional computation, such as verification or reference schema generation (e.g. a JSON document representing an edge in the citation graph). This approach allows us to work with exact identifiers, as well as fuzzy matching over partial data. ---- # Refs * [1] http://garfield.library.upenn.edu/papers/science1955.pdf * [2] https://authors.library.caltech.edu/24838/1/ROTcs05.pdf * [3] As of 04/2021 the total input size is about 1.6TB uncompressed JSON documents.