diff options
-rw-r--r-- | skate/README.md | 24 |
1 files changed, 22 insertions, 2 deletions
diff --git a/skate/README.md b/skate/README.md index f3a4463..a63ce18 100644 --- a/skate/README.md +++ b/skate/README.md @@ -19,8 +19,8 @@ project for performance (and we saw a 25x speedup for certain tasks). ## Overview We follow a map-reduce style approach (on a single machine): We extract -specific keys from data. We group items with the same *key* together and apply -some computation on these groups. +specific keys from data. We group items (via sort) with the same *key* together +and apply some computation on these groups. Mapper is defined as function type, mapping a blob of data (e.g. a single JSON object) to a number of fields (e.g. key, value). @@ -70,3 +70,23 @@ a bit generic in subpackage [zipkey](zipkey). Scale: few millions to up to few billions of docs +## TODO and Issues + ++ [ ] a clearer way to deduplicate edges + +Currently, we use an the `source_release_ident` and `ref_index` as an +elasticsearch key. This means that we reference docs coming from different +sources (e.g. crossref, grobid, etc.) but representing the same item must match +in their index, which we neither can guarantee nor be robust for various +sources, where indices may change frequently. + +A clearer path could be: + +1. group all matches (and non-matches) by source work id (already the case) +2. generate a list of unique refs, unique by source-target, w/o any index +3. for any source-target with more than one occurance, understand whićh one of these we want to include + +Step 3 can now go into all depth understanding multiplicities, e.g. is it an +"ebd", "ff" type? Does it come from different source (e.g. then choose the one +most likely being correct, etc), ... + |