diff options
-rw-r--r-- | proposals/2021_04_23_citation_graph_tooling.md | 36 |
1 files changed, 28 insertions, 8 deletions
diff --git a/proposals/2021_04_23_citation_graph_tooling.md b/proposals/2021_04_23_citation_graph_tooling.md index d12ce2a..d90981a 100644 --- a/proposals/2021_04_23_citation_graph_tooling.md +++ b/proposals/2021_04_23_citation_graph_tooling.md @@ -1,28 +1,48 @@ # Building a Citation Graph * date: 2021-04-23 -* status: implemented +* status: wip ## Problem and Goal We want to generate a citation graph including bibliographic data from fatcat, -wikipedia, open library; additionally, we want to record web links as targets. +wikipedia, open library and other sources; we also want to include web links. Citations indices and graphs can be traced back at least to the seminal paper *Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2] published in 2005 already lists 17 services that include cited reference search. -Roughly three broad problems need to be solved: +There are two main document types: a catalog record and a entry describing a +citation. Both can contain partial information only. -* A catalog (source) needs to be available, containing relatively clean - metadata on scholarly communication documents. -* The reference data needs to be available, either in metadata directly or by - extraction from documents. -* The datasets need to be compared. +## The Funnel Approach +To link a reference entry to a catalog record we use a funnel approach. That +is, we start with the most common (or the easiest) pattern in the data, then +iterate and look at harder or more obscure patterns. +The simplest and most reliable way of linkage is by persitent identifier (PID) +or per-source unique identifier (such as PubMed ID). If no identifier is +available, we fall back to a fuzzy matching and verification approach, that +implements data specific rules for matching. +## Implementation + +A goal is to start small, and eventuelly move to a canonical data framework for +processing, if approriate or necessary. + +Especially we would like to make it fast to analyze a few billion reference +entries in a reasonable amount of time with little setup and intuitive command +line tooling. + +We use a *map-reduce* approach, especially we derive a key from a document +and pass the (key, document) tuples sharing a key to a reduce function, which +performs additional computation, such as verification or reference schema +generation (e.g. a JSON document representing an edge in the citation graph). + +This approach allows us to work with exact identifiers, as well as fuzzy +matching over partial data. ---- |