From c66a240ddcc6127bff25cd734b89aa2efa097cc8 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Fri, 6 Aug 2021 22:52:40 +0200 Subject: guide: draft notes on background and mode of operatio --- guide/src/reference_graph.md | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) (limited to 'guide/src/reference_graph.md') diff --git a/guide/src/reference_graph.md b/guide/src/reference_graph.md index 3b773150..a38a017b 100644 --- a/guide/src/reference_graph.md +++ b/guide/src/reference_graph.md @@ -1,9 +1,29 @@ # Reference Graph -As a new feature, fuzzy-matched references are available on an "inbound" and -"outbound" basis in the web interface. +Since 08/2021 references are available on an "inbound" and "outbound" basis in +the web interface. The backend reference graph is available via the [Search API](./search_api.md) under the `fatcat_ref` index. +## Background and Mode of Operation + +Release entities in fatcat have a [refs fields](./entity_release.md) which +contains citations, which in turn may be identified in different ways. Another +source of reference metadata is provided by structured data extraction from PDF +with tools such as [GROBID](https://grobid.readthedocs.io). The raw reference data combined +amounts to over 2B documents which we take as input for a batch process, that +derives the graph structure. + +Two main modes of citation matching are employed: identifier based matching and +fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids, +PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster +documents (with pluggable clustering algorithms). For each cluster of match +candidates we run a more extensive verification process, which yields a match +confidence category, ranging from weak over strong to exact. Strong and exact +matches are included in the graph. + +The current reference search index contains both matches and yet unmatched +references. We expect this dataset to be iterated over regularly as there are +a few dimensions along which the dataset can be improved and extended. -- cgit v1.2.3