diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-08-06 22:52:40 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-08-06 22:52:40 +0200 |
commit | c66a240ddcc6127bff25cd734b89aa2efa097cc8 (patch) | |
tree | 57418db1f79ff0435e9b31ad15c18aa362dd8d4e | |
parent | cf5a03d8803d660b69f781827c8bc0828ae7ca13 (diff) | |
download | fatcat-c66a240ddcc6127bff25cd734b89aa2efa097cc8.tar.gz fatcat-c66a240ddcc6127bff25cd734b89aa2efa097cc8.zip |
guide: draft notes on background and mode of operatio
-rw-r--r-- | guide/src/reference_graph.md | 24 |
1 files changed, 22 insertions, 2 deletions
diff --git a/guide/src/reference_graph.md b/guide/src/reference_graph.md index 3b773150..a38a017b 100644 --- a/guide/src/reference_graph.md +++ b/guide/src/reference_graph.md @@ -1,9 +1,29 @@ # Reference Graph -As a new feature, fuzzy-matched references are available on an "inbound" and -"outbound" basis in the web interface. +Since 08/2021 references are available on an "inbound" and "outbound" basis in +the web interface. The backend reference graph is available via the [Search API](./search_api.md) under the `fatcat_ref` index. +## Background and Mode of Operation + +Release entities in fatcat have a [refs fields](./entity_release.md) which +contains citations, which in turn may be identified in different ways. Another +source of reference metadata is provided by structured data extraction from PDF +with tools such as [GROBID](https://grobid.readthedocs.io). The raw reference data combined +amounts to over 2B documents which we take as input for a batch process, that +derives the graph structure. + +Two main modes of citation matching are employed: identifier based matching and +fuzzy matching. Identifier based matching currently works with DOI, Arxiv ids, +PMID and PMCID and ISBN. Fuzzy matching employs a scalable way to cluster +documents (with pluggable clustering algorithms). For each cluster of match +candidates we run a more extensive verification process, which yields a match +confidence category, ranging from weak over strong to exact. Strong and exact +matches are included in the graph. + +The current reference search index contains both matches and yet unmatched +references. We expect this dataset to be iterated over regularly as there are +a few dimensions along which the dataset can be improved and extended. |