wip: update proposal

author: Martin Czygan <martin.czygan@gmail.com> 2021-04-23 12:10:43 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-04-23 12:10:43 +0200
commit: 1d4b8af2d107e5978f4d161f86233e5037174d21 (patch)
tree: 009f60fb02f3804afc9895674d38a070566606f2 /proposals
parent: fd8960c6942533867b8a05cdeb8eee55ac21860c (diff)
download: refcat-1d4b8af2d107e5978f4d161f86233e5037174d21.tar.gz
refcat-1d4b8af2d107e5978f4d161f86233e5037174d21.zip
1 files changed, 28 insertions, 8 deletions
diff --git a/proposals/2021_04_23_citation_graph_tooling.md b/proposals/2021_04_23_citation_graph_tooling.md
index d12ce2a..d90981a 100644
--- a/proposals/2021_04_23_citation_graph_tooling.md
+++ b/proposals/2021_04_23_citation_graph_tooling.md
@@ -1,28 +1,48 @@
 # Building a Citation Graph
 
 * date: 2021-04-23
-* status: implemented
+* status: wip
 
 ## Problem and Goal
 
 We want to generate a citation graph including bibliographic data from fatcat,
-wikipedia, open library; additionally, we want to record web links as targets.
+wikipedia, open library and other sources; we also want to include web links.
 
 Citations indices and graphs can be traced back at least to the seminal paper
 *Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2]
 published in 2005 already lists 17 services that include cited reference
 search.
 
-Roughly three broad problems need to be solved:
+There are two main document types: a catalog record and a entry describing a
+citation. Both can contain partial information only.
 
-* A catalog (source) needs to be available, containing relatively clean
-  metadata on scholarly communication documents.
-* The reference data needs to be available, either in metadata directly or by
-  extraction from documents.
-* The datasets need to be compared.
+## The Funnel Approach
 
+To link a reference entry to a catalog record we use a funnel approach. That
+is, we start with the most common (or the easiest) pattern in the data, then
+iterate and look at harder or more obscure patterns.
 
+The simplest and most reliable way of linkage is by persitent identifier (PID)
+or per-source unique identifier (such as PubMed ID). If no identifier is
+available, we fall back to a fuzzy matching and verification approach, that
+implements data specific rules for matching.
 
+## Implementation
+
+A goal is to start small, and eventuelly move to a canonical data framework for
+processing, if approriate or necessary.
+
+Especially we would like to make it fast to analyze a few billion reference
+entries in a reasonable amount of time with little setup and intuitive command
+line tooling.
+
+We use a *map-reduce* approach, especially we derive a key from a document
+and pass the (key, document) tuples sharing a key to a reduce function, which
+performs additional computation, such as verification or reference schema
+generation (e.g. a JSON document representing an edge in the citation graph).
+
+This approach allows us to work with exact identifiers, as well as fuzzy
+matching over partial data.
 
 ----
author	Martin Czygan <martin.czygan@gmail.com>	2021-04-23 12:10:43 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-04-23 12:10:43 +0200
commit	1d4b8af2d107e5978f4d161f86233e5037174d21 (patch)
tree	009f60fb02f3804afc9895674d38a070566606f2 /proposals
parent	fd8960c6942533867b8a05cdeb8eee55ac21860c (diff)
download	refcat-1d4b8af2d107e5978f4d161f86233e5037174d21.tar.gz refcat-1d4b8af2d107e5978f4d161f86233e5037174d21.zip