aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-04-23 12:10:43 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-04-23 12:10:43 +0200
commit1d4b8af2d107e5978f4d161f86233e5037174d21 (patch)
tree009f60fb02f3804afc9895674d38a070566606f2
parentfd8960c6942533867b8a05cdeb8eee55ac21860c (diff)
downloadrefcat-1d4b8af2d107e5978f4d161f86233e5037174d21.tar.gz
refcat-1d4b8af2d107e5978f4d161f86233e5037174d21.zip
wip: update proposal
-rw-r--r--proposals/2021_04_23_citation_graph_tooling.md36
1 files changed, 28 insertions, 8 deletions
diff --git a/proposals/2021_04_23_citation_graph_tooling.md b/proposals/2021_04_23_citation_graph_tooling.md
index d12ce2a..d90981a 100644
--- a/proposals/2021_04_23_citation_graph_tooling.md
+++ b/proposals/2021_04_23_citation_graph_tooling.md
@@ -1,28 +1,48 @@
# Building a Citation Graph
* date: 2021-04-23
-* status: implemented
+* status: wip
## Problem and Goal
We want to generate a citation graph including bibliographic data from fatcat,
-wikipedia, open library; additionally, we want to record web links as targets.
+wikipedia, open library and other sources; we also want to include web links.
Citations indices and graphs can be traced back at least to the seminal paper
*Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2]
published in 2005 already lists 17 services that include cited reference
search.
-Roughly three broad problems need to be solved:
+There are two main document types: a catalog record and a entry describing a
+citation. Both can contain partial information only.
-* A catalog (source) needs to be available, containing relatively clean
- metadata on scholarly communication documents.
-* The reference data needs to be available, either in metadata directly or by
- extraction from documents.
-* The datasets need to be compared.
+## The Funnel Approach
+To link a reference entry to a catalog record we use a funnel approach. That
+is, we start with the most common (or the easiest) pattern in the data, then
+iterate and look at harder or more obscure patterns.
+The simplest and most reliable way of linkage is by persitent identifier (PID)
+or per-source unique identifier (such as PubMed ID). If no identifier is
+available, we fall back to a fuzzy matching and verification approach, that
+implements data specific rules for matching.
+## Implementation
+
+A goal is to start small, and eventuelly move to a canonical data framework for
+processing, if approriate or necessary.
+
+Especially we would like to make it fast to analyze a few billion reference
+entries in a reasonable amount of time with little setup and intuitive command
+line tooling.
+
+We use a *map-reduce* approach, especially we derive a key from a document
+and pass the (key, document) tuples sharing a key to a reduce function, which
+performs additional computation, such as verification or reference schema
+generation (e.g. a JSON document representing an edge in the citation graph).
+
+This approach allows us to work with exact identifiers, as well as fuzzy
+matching over partial data.
----