update proposal

author: Martin Czygan <martin.czygan@gmail.com> 2021-04-24 00:48:56 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-04-24 00:48:56 +0200
commit: 1ab13de4c6fff5db11d8d39936b4513196a56cee (patch)
tree: 450b5a6fdb517f3bfa8a85132bda15f4d6678879 /proposals
parent: 1d4b8af2d107e5978f4d161f86233e5037174d21 (diff)
download: refcat-1ab13de4c6fff5db11d8d39936b4513196a56cee.tar.gz
refcat-1ab13de4c6fff5db11d8d39936b4513196a56cee.zip
1 files changed, 13 insertions, 11 deletions
diff --git a/proposals/2021_04_23_citation_graph_tooling.md b/proposals/2021_04_23_citation_graph_tooling.md
index d90981a..328dec7 100644
--- a/proposals/2021_04_23_citation_graph_tooling.md
+++ b/proposals/2021_04_23_citation_graph_tooling.md
@@ -6,17 +6,18 @@
 ## Problem and Goal
 
 We want to generate a citation graph including bibliographic data from fatcat,
-wikipedia, open library and other sources; we also want to include web links.
+open library, wikipedia and other sources; we also want to include archived web
+pages referenced in papers.
 
 Citations indices and graphs can be traced back at least to the seminal paper
 *Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2]
 published in 2005 already lists 17 services that include cited reference
-search.
+search. Citation counts are common elements on scholarly search engine sites.
 
-There are two main document types: a catalog record and a entry describing a
-citation. Both can contain partial information only.
+We are working with two main document types: a catalog record and a entry
+describing a citation. Both can contain partial information only.
 
-## The Funnel Approach
+## A Funnel Approach
 
 To link a reference entry to a catalog record we use a funnel approach. That
 is, we start with the most common (or the easiest) pattern in the data, then
@@ -30,15 +31,15 @@ implements data specific rules for matching.
 ## Implementation
 
 A goal is to start small, and eventuelly move to a canonical data framework for
-processing, if approriate or necessary.
+processing, if approriate or necessary [3].
 
 Especially we would like to make it fast to analyze a few billion reference
-entries in a reasonable amount of time with little setup and intuitive command
-line tooling.
+entries in a reasonable amount of time with little setup and minimal resource
+dependencies.
 
-We use a *map-reduce* approach, especially we derive a key from a document
-and pass the (key, document) tuples sharing a key to a reduce function, which
-performs additional computation, such as verification or reference schema
+We use a *map-reduce* like processing model. Especially we derive a key from a
+document and pass (key, document) tuples sharing a key to a reduce function,
+which performs additional computation, such as verification or reference schema
 generation (e.g. a JSON document representing an edge in the citation graph).
 
 This approach allows us to work with exact identifiers, as well as fuzzy
@@ -50,4 +51,5 @@ matching over partial data.
 
 * [1] http://garfield.library.upenn.edu/papers/science1955.pdf
 * [2] https://authors.library.caltech.edu/24838/1/ROTcs05.pdf
+* [3] As of 04/2021 the total input size is about 1.6TB uncompressed JSON documents.
author	Martin Czygan <martin.czygan@gmail.com>	2021-04-24 00:48:56 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-04-24 00:48:56 +0200
commit	1ab13de4c6fff5db11d8d39936b4513196a56cee (patch)
tree	450b5a6fdb517f3bfa8a85132bda15f4d6678879 /proposals
parent	1d4b8af2d107e5978f4d161f86233e5037174d21 (diff)
download	refcat-1ab13de4c6fff5db11d8d39936b4513196a56cee.tar.gz refcat-1ab13de4c6fff5db11d8d39936b4513196a56cee.zip