From e5b01062cec62216fb7c4f0806f2d997f70097f8 Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Tue, 27 Jul 2021 12:12:04 +0200
Subject: update todo notes

---
 skate/README.md | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/skate/README.md b/skate/README.md
index f3a4463..a63ce18 100644
--- a/skate/README.md
+++ b/skate/README.md
@@ -19,8 +19,8 @@ project for performance (and we saw a 25x speedup for certain tasks).
 ## Overview
 
 We follow a map-reduce style approach (on a single machine): We extract
-specific keys from data. We group items with the same *key* together and apply
-some computation on these groups.
+specific keys from data. We group items (via sort) with the same *key* together
+and apply some computation on these groups.
 
 Mapper is defined as function type, mapping a blob of data (e.g. a single JSON
 object) to a number of fields (e.g. key, value).
@@ -70,3 +70,23 @@ a bit generic in subpackage [zipkey](zipkey).
 
 Scale: few millions to up to few billions of docs
 
+## TODO and Issues
+
++ [ ] a clearer way to deduplicate edges
+
+Currently, we use an the `source_release_ident` and `ref_index` as an
+elasticsearch key. This means that we reference docs coming from different
+sources (e.g. crossref, grobid, etc.) but representing the same item must match
+in their index, which we neither can guarantee nor be robust for various
+sources, where indices may change frequently.
+
+A clearer path could be:
+
+1. group all matches (and non-matches) by source work id (already the case)
+2. generate a list of unique refs, unique by source-target, w/o any index
+3. for any source-target with more than one occurance, understand whićh one of these we want to include
+
+Step 3 can now go into all depth understanding multiplicities, e.g. is it an
+"ebd", "ff" type? Does it come from different source (e.g. then choose the one
+most likely being correct, etc), ...
+
-- 
cgit v1.2.3