aboutsummaryrefslogtreecommitdiffstats
path: root/guide
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-08-06 22:52:40 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-08-06 22:52:40 +0200
commitc66a240ddcc6127bff25cd734b89aa2efa097cc8 (patch)
tree57418db1f79ff0435e9b31ad15c18aa362dd8d4e /guide
parentcf5a03d8803d660b69f781827c8bc0828ae7ca13 (diff)
downloadfatcat-c66a240ddcc6127bff25cd734b89aa2efa097cc8.tar.gz
fatcat-c66a240ddcc6127bff25cd734b89aa2efa097cc8.zip
guide: draft notes on background and mode of operatio
Diffstat (limited to 'guide')
-rw-r--r--guide/src/reference_graph.md24
1 files changed, 22 insertions, 2 deletions
diff --git a/guide/src/reference_graph.md b/guide/src/reference_graph.md
index 3b773150..a38a017b 100644
--- a/guide/src/reference_graph.md
+++ b/guide/src/reference_graph.md
@@ -1,9 +1,29 @@
# Reference Graph
-As a new feature, fuzzy-matched references are available on an "inbound" and
-"outbound" basis in the web interface.
+Since 08/2021 references are available on an "inbound" and "outbound" basis in
+the web interface.
The backend reference graph is available via the [Search API](./search_api.md)
under the `fatcat_ref` index.
+## Background and Mode of Operation
+
+Release entities in fatcat have a [refs fields](./entity_release.md) which
+contains citations, which in turn may be identified in different ways. Another
+source of reference metadata is provided by structured data extraction from PDF
+with tools such as [GROBID](https://grobid.readthedocs.io). The raw reference data combined
+amounts to over 2B documents which we take as input for a batch process, that
+derives the graph structure.
+
+Two main modes of citation matching are employed: identifier based matching and
+fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids,
+PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster
+documents (with pluggable clustering algorithms). For each cluster of match
+candidates we run a more extensive verification process, which yields a match
+confidence category, ranging from weak over strong to exact. Strong and exact
+matches are included in the graph.
+
+The current reference search index contains both matches and yet unmatched
+references. We expect this dataset to be iterated over regularly as there are
+a few dimensions along which the dataset can be improved and extended.