Merge branch 'martin-guide-ref' into 'master'

guide: draft notes on background and mode of operatio See merge request webgroup/fatcat!114
author: bnewbold <bnewbold@archive.org> 2021-08-06 21:02:11 +0000
committer: bnewbold <bnewbold@archive.org> 2021-08-06 21:02:11 +0000
commit: 0a28fa4ecb2dfc23ad6a21c3d81657ff9f2a62a5 (patch)
tree: 57418db1f79ff0435e9b31ad15c18aa362dd8d4e /guide
parent: cf5a03d8803d660b69f781827c8bc0828ae7ca13 (diff)
parent: c66a240ddcc6127bff25cd734b89aa2efa097cc8 (diff)
download: fatcat-0a28fa4ecb2dfc23ad6a21c3d81657ff9f2a62a5.tar.gz
fatcat-0a28fa4ecb2dfc23ad6a21c3d81657ff9f2a62a5.zip
1 files changed, 22 insertions, 2 deletions
diff --git a/guide/src/reference_graph.md b/guide/src/reference_graph.md
index 3b773150..a38a017b 100644
--- a/guide/src/reference_graph.md
+++ b/guide/src/reference_graph.md
@@ -1,9 +1,29 @@
 
 # Reference Graph
 
-As a new feature, fuzzy-matched references are available on an "inbound" and
-"outbound" basis in the web interface.
+Since 08/2021 references are available on an "inbound" and "outbound" basis in
+the web interface.
 
 The backend reference graph is available via the [Search API](./search_api.md)
 under the `fatcat_ref` index.
 
+## Background and Mode of Operation
+
+Release entities in fatcat have a [refs fields](./entity_release.md) which
+contains citations, which in turn may be identified in different ways. Another
+source of reference metadata is provided by structured data extraction from PDF
+with tools such as [GROBID](https://grobid.readthedocs.io). The raw reference data combined
+amounts to over 2B documents which we take as input for a batch process, that
+derives the graph structure.
+
+Two main modes of citation matching are employed: identifier based matching and
+fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids,
+PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster
+documents (with pluggable clustering algorithms). For each cluster of match
+candidates we run a more extensive verification process, which yields a match
+confidence category, ranging from weak over strong to exact. Strong and exact
+matches are included in the graph.
+
+The current reference search index contains both matches and yet unmatched
+references. We expect this dataset to be iterated over regularly as there are
+a few dimensions along which the dataset can be improved and extended.
author	bnewbold <bnewbold@archive.org>	2021-08-06 21:02:11 +0000
committer	bnewbold <bnewbold@archive.org>	2021-08-06 21:02:11 +0000
commit	0a28fa4ecb2dfc23ad6a21c3d81657ff9f2a62a5 (patch)
tree	57418db1f79ff0435e9b31ad15c18aa362dd8d4e /guide
parent	cf5a03d8803d660b69f781827c8bc0828ae7ca13 (diff)
parent	c66a240ddcc6127bff25cd734b89aa2efa097cc8 (diff)
download	fatcat-0a28fa4ecb2dfc23ad6a21c3d81657ff9f2a62a5.tar.gz fatcat-0a28fa4ecb2dfc23ad6a21c3d81657ff9f2a62a5.zip