aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-04-24 00:48:56 +0200
committerMartin Czygan <martin.czygan@gmail.com>2021-04-24 00:48:56 +0200
commit1ab13de4c6fff5db11d8d39936b4513196a56cee (patch)
tree450b5a6fdb517f3bfa8a85132bda15f4d6678879
parent1d4b8af2d107e5978f4d161f86233e5037174d21 (diff)
downloadrefcat-1ab13de4c6fff5db11d8d39936b4513196a56cee.tar.gz
refcat-1ab13de4c6fff5db11d8d39936b4513196a56cee.zip
update proposal
-rw-r--r--proposals/2021_04_23_citation_graph_tooling.md24
1 files changed, 13 insertions, 11 deletions
diff --git a/proposals/2021_04_23_citation_graph_tooling.md b/proposals/2021_04_23_citation_graph_tooling.md
index d90981a..328dec7 100644
--- a/proposals/2021_04_23_citation_graph_tooling.md
+++ b/proposals/2021_04_23_citation_graph_tooling.md
@@ -6,17 +6,18 @@
## Problem and Goal
We want to generate a citation graph including bibliographic data from fatcat,
-wikipedia, open library and other sources; we also want to include web links.
+open library, wikipedia and other sources; we also want to include archived web
+pages referenced in papers.
Citations indices and graphs can be traced back at least to the seminal paper
*Citation indexes for science* by Garfield, 1955 [1]. A anniversary paper [2]
published in 2005 already lists 17 services that include cited reference
-search.
+search. Citation counts are common elements on scholarly search engine sites.
-There are two main document types: a catalog record and a entry describing a
-citation. Both can contain partial information only.
+We are working with two main document types: a catalog record and a entry
+describing a citation. Both can contain partial information only.
-## The Funnel Approach
+## A Funnel Approach
To link a reference entry to a catalog record we use a funnel approach. That
is, we start with the most common (or the easiest) pattern in the data, then
@@ -30,15 +31,15 @@ implements data specific rules for matching.
## Implementation
A goal is to start small, and eventuelly move to a canonical data framework for
-processing, if approriate or necessary.
+processing, if approriate or necessary [3].
Especially we would like to make it fast to analyze a few billion reference
-entries in a reasonable amount of time with little setup and intuitive command
-line tooling.
+entries in a reasonable amount of time with little setup and minimal resource
+dependencies.
-We use a *map-reduce* approach, especially we derive a key from a document
-and pass the (key, document) tuples sharing a key to a reduce function, which
-performs additional computation, such as verification or reference schema
+We use a *map-reduce* like processing model. Especially we derive a key from a
+document and pass (key, document) tuples sharing a key to a reduce function,
+which performs additional computation, such as verification or reference schema
generation (e.g. a JSON document representing an edge in the citation graph).
This approach allows us to work with exact identifiers, as well as fuzzy
@@ -50,4 +51,5 @@ matching over partial data.
* [1] http://garfield.library.upenn.edu/papers/science1955.pdf
* [2] https://authors.library.caltech.edu/24838/1/ROTcs05.pdf
+* [3] As of 04/2021 the total input size is about 1.6TB uncompressed JSON documents.