diff options
Diffstat (limited to 'proposals')
-rw-r--r-- | proposals/2021_04_23_citation_graph_tooling.md | 24 |
1 files changed, 13 insertions, 11 deletions
diff --git a/proposals/2021_04_23_citation_graph_tooling.md b/proposals/2021_04_23_citation_graph_tooling.md index d90981a..328dec7 100644 --- a/proposals/2021_04_23_citation_graph_tooling.md +++ b/proposals/2021_04_23_citation_graph_tooling.md @@ -6,17 +6,18 @@ ## Problem and Goal We want to generate a citation graph including bibliographic data from fatcat, -wikipedia, open library and other sources; we also want to include web links. +open library, wikipedia and other sources; we also want to include archived web +pages referenced in papers. Citations indices and graphs can be traced back at least to the seminal paper *Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2] published in 2005 already lists 17 services that include cited reference -search. +search. Citation counts are common elements on scholarly search engine sites. -There are two main document types: a catalog record and a entry describing a -citation. Both can contain partial information only. +We are working with two main document types: a catalog record and a entry +describing a citation. Both can contain partial information only. -## The Funnel Approach +## A Funnel Approach To link a reference entry to a catalog record we use a funnel approach. That is, we start with the most common (or the easiest) pattern in the data, then @@ -30,15 +31,15 @@ implements data specific rules for matching. ## Implementation A goal is to start small, and eventuelly move to a canonical data framework for -processing, if approriate or necessary. +processing, if approriate or necessary [3]. Especially we would like to make it fast to analyze a few billion reference -entries in a reasonable amount of time with little setup and intuitive command -line tooling. +entries in a reasonable amount of time with little setup and minimal resource +dependencies. -We use a *map-reduce* approach, especially we derive a key from a document -and pass the (key, document) tuples sharing a key to a reduce function, which -performs additional computation, such as verification or reference schema +We use a *map-reduce* like processing model. Especially we derive a key from a +document and pass (key, document) tuples sharing a key to a reduce function, +which performs additional computation, such as verification or reference schema generation (e.g. a JSON document representing an edge in the citation graph). This approach allows us to work with exact identifiers, as well as fuzzy @@ -50,4 +51,5 @@ matching over partial data. * [1] http://garfield.library.upenn.edu/papers/science1955.pdf * [2] https://authors.library.caltech.edu/24838/1/ROTcs05.pdf +* [3] As of 04/2021 the total input size is about 1.6TB uncompressed JSON documents. |