guide/src/reference_graph.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


# Reference Graph

Since 08/2021 references are available on an "inbound" and "outbound" basis in
the web interface.

The backend reference graph is available via the [Search API](./search_api.md)
under the `fatcat_ref` index.

## Background and Mode of Operation

Release entities in fatcat have a [refs fields](./entity_release.md) which
contains citations, which in turn may be identified in different ways. Another
source of reference metadata is provided by structured data extraction from PDF
with tools such as [GROBID](https://grobid.readthedocs.io). The raw reference data combined
amounts to over 2B documents which we take as input for a batch process, that
derives the graph structure.

Two main modes of citation matching are employed: identifier based matching and
fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids,
PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster
documents (with pluggable clustering algorithms). For each cluster of match
candidates we run a more extensive verification process, which yields a match
confidence category, ranging from weak over strong to exact. Strong and exact
matches are included in the graph.

The current reference search index contains both matches and yet unmatched
references. We expect this dataset to be iterated over regularly as there are
a few dimensions along which the dataset can be improved and extended.