README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87


![](static/6dSaW2q.png)

`refcat`: large-scale citation graph generation tools
=====================================================

An assembly of software tools in Python and Go, which together are used to
compile a citation graph with billions of edges (references) and hundreds of
millions of nodes (papers).

Maintained by [martin@archive.org](mailto:martin@archive.org) at the Internet
Archive, as part of the [fatcat](https://fatcat.wiki) and
[scholar.archive.org](https://scholar.archive.org) projects.

Code is organized into sub-modules, with their own documentation:

* [python/](python/README.md): mostly [luigi](https://github.com/spotify/luigi) tasks (using
  [shiv](https://github.com/linkedin/shiv) for single-file deployments)
* [skate/](skate/README.md): various Go command line tools (packaged as deb) for extracting keys, cleanup, join and serialization tasks

The python code also builds on top of the [fuzzycat](https://pypi.org/project/fuzzycat/) library.

A first version of the citation graph dataset has been uploaded on Aug 7, 2021
to
[https://archive.org/details/refcat_2021-07-28](https://archive.org/details/refcat_2021-07-28).
You can find additional information on the project in the [fatcat
guide](https://guide.fatcat.wiki/reference_graph.html), [blog
post](https://blog.archive.org/2021/10/19/internet-archive-releases-refcat-the-ia-scholar-index-of-over-1-3-billion-scholarly-citations/)
and in a [technical report](https://arxiv.org/abs/2110.06595).


## Overview

The high level goals of this project are:

* deriving a [citation graph](https://en.wikipedia.org/wiki/Citation_graph) dataset from scholarly metadata
* beside paper-to-paper links the graph should also contain paper-to-book (open library) and paper-to-webpage (wayback machine) and other datasets (e.g. wikipedia)
* publication of this dataset in a suitable format, alongside a description of its content (e.g. as a technical report)


The main challenges are:

* currently 2.5B references documents (~1TB raw textual data); possibly going up to 2-4B (1-2TB raw textual data)
* currently a single machine setup (16 cores, 16T disk; note: we compress with
  [zstd](https://github.com/facebook/zstd), which gives us about 5x space, 2x
  speedup)
* partial metadata (requiring separate code paths)
* data quality issues (e.g. need extra care to extract URLs, DOI, ISBN, etc.
  since a good chunk of the metadata comes from ML based [PDF metadata
  extraction](https://grobid.readthedocs.io))
* fuzzy matching and verification at scale (e.g. verifying 1M clustered documents per minute)


Internet Archive use cases for the output citation graph include:

* discovery tool, e.g. "cited by ..." link on [fatcat.wiki](https://fatcat.wiki/release/bza3ovudezahlexibdtoytgtb4/refs-in)
* lookup things cited by a [wikipedia page](https://fatcat.wiki/wikipedia/en:Internet/refs-out), papers citing [books](https://fatcat.wiki/openlibrary/OL2141999W/refs-in) or papers referencing web pages (wip)
* metadata discovery; e.g. identify popularly cited works which are missing
  (aka, have [no *matched*](https://git.archive.org/webgroup/refcat/-/blob/eb6dec279d66d35433f0ea7df1c1399896b111ce/python/refcat/tasks.py#L461-488)
  record in the catalog)
* Turn All References Blue (TARB, [notes](https://meta.wikimedia.org/wiki/GLAMTLV2018/Submissions/Turn_All_References_Blue!), [presentation](https://archive.org/details/mark-graham-presentation))

Original design documents for this project are included in the fatcat git
repository: [Bulk Citation Graph (Oct 2020)](https://github.com/internetarchive/fatcat/blob/master/proposals/202008_bulk_citation_graph.md),
[Reference Graph API and Schema (Jan 2021)](https://github.com/internetarchive/fatcat/blob/master/proposals/2021-01-29_citation_api.md)

## Progress

We use informal, internal versioning for the graph currently v3, next will be v4/v5.

Current status (version 2):

* matches via: doi, arxiv, pmid, pmcid, fuzzy title matches
* 785,569,011 edges (~103% of 12/2020 OCI/crossref release), ~39G compressed, ~288G uncompressed

Notes by iteration:

* [python/notes/version_0.md](python/notes/version_0.md)
* [python/notes/version_1.md](python/notes/version_1.md)
* [python/notes/version_2.md](python/notes/version_2.md)
* [python/notes/version_3.md](python/notes/version_3.md)

## Support and Acknowledgements

Work on this software received support from the Andrew W. Mellon Foundation through multiple phases of the ["Ensuring the Persistent Access of Open Access Journal Literature"](https://mellon.org/grants/grants-database/advanced-search/?amount-low=&amount-high=&year-start=&year-end=&city=&state=&country=&q=%22Ensuring+the+Persistent+Access%22&per_page=25) project (see [original announcement](http://blog.archive.org/2018/03/05/andrew-w-mellon-foundation-awards-grant-to-the-internet-archive-for-long-tail-journal-preservation/)).

Additional acknowledgements [at fatcat.wiki](https://fatcat.wiki/about).