aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--skate/README.md24
1 files changed, 22 insertions, 2 deletions
diff --git a/skate/README.md b/skate/README.md
index f3a4463..a63ce18 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -19,8 +19,8 @@ project for performance (and we saw a 25x speedup for certain tasks).
## Overview
We follow a map-reduce style approach (on a single machine): We extract
-specific keys from data. We group items with the same *key* together and apply
-some computation on these groups.
+specific keys from data. We group items (via sort) with the same *key* together
+and apply some computation on these groups.
Mapper is defined as function type, mapping a blob of data (e.g. a single JSON
object) to a number of fields (e.g. key, value).
@@ -70,3 +70,23 @@ a bit generic in subpackage [zipkey](zipkey).
Scale: few millions to up to few billions of docs
+## TODO and Issues
+
++ [ ] a clearer way to deduplicate edges
+
+Currently, we use an the `source_release_ident` and `ref_index` as an
+elasticsearch key. This means that we reference docs coming from different
+sources (e.g. crossref, grobid, etc.) but representing the same item must match
+in their index, which we neither can guarantee nor be robust for various
+sources, where indices may change frequently.
+
+A clearer path could be:
+
+1. group all matches (and non-matches) by source work id (already the case)
+2. generate a list of unique refs, unique by source-target, w/o any index
+3. for any source-target with more than one occurance, understand whićh one of these we want to include
+
+Step 3 can now go into all depth understanding multiplicities, e.g. is it an
+"ebd", "ff" type? Does it come from different source (e.g. then choose the one
+most likely being correct, etc), ...
+