From 9aeacc07be8151a0d44d25cbe377c9f4a09a620a Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 22 Oct 2020 11:28:57 +0200 Subject: update notes on clustering --- notes/Clustering.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/notes/Clustering.md b/notes/Clustering.md index d390035..d794bdc 100644 --- a/notes/Clustering.md +++ b/notes/Clustering.md @@ -36,3 +36,21 @@ Numbers of clusters: * [ ] do a SS like clustering, using title and author ngrams * [ ] cluster by doi without "vX" suffix + +# Verification + +* we only need to look at identified duplicates, which will be a few millions +* we want fast access to all release JSON blob via ident, maybe do a + "fuzzycat-cache" that copies relevant files into the fs, e.g. +"~/.cache/fuzzycat/releases/d9/e4d4be49faafc750563351a126e7bafe29.json or via microblob (but http we do not need), or sqlite3 (https://www.sqlite.org/fasterthanfs.html) + +For verification we need to have the cached json blobs in some fast, +thread-safe store. Estimated: 1K/s accesses, we still would need a few hours +for a run. + +* [ ] find all ids we need, generate cache, maybe reduce number of fields +* [ ] run verification on each cluster; generate a file of same format of + "verified" clusters; take note the clustering and verification method + +Overall, we can combine various clustering and verification methods. We can +also put together a list of maybe 100-200 test cases and evaluate methods. -- cgit v1.2.3