update notes on clustering

author: Martin Czygan <martin.czygan@gmail.com> 2020-10-22 11:28:57 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2020-10-22 11:28:57 +0200
commit: 9aeacc07be8151a0d44d25cbe377c9f4a09a620a (patch)
tree: 09343be2bf28c6155b0a84440d0b8b9b42b3c598
parent: f01429bbc70cf8ed8cbc114956cd37236e65fd4a (diff)
download: fuzzycat-9aeacc07be8151a0d44d25cbe377c9f4a09a620a.tar.gz
fuzzycat-9aeacc07be8151a0d44d25cbe377c9f4a09a620a.zip
1 files changed, 18 insertions, 0 deletions
diff --git a/notes/Clustering.md b/notes/Clustering.md
index d390035..d794bdc 100644
--- a/notes/Clustering.md
+++ b/notes/Clustering.md
@@ -36,3 +36,21 @@ Numbers of clusters:
 
 * [ ] do a SS like clustering, using title and author ngrams
 * [ ] cluster by doi without "vX" suffix
+
+# Verification
+
+* we only need to look at identified duplicates, which will be a few millions
+* we want fast access to all release JSON blob via ident, maybe do a
+  "fuzzycat-cache" that copies relevant files into the fs, e.g.
+"~/.cache/fuzzycat/releases/d9/e4d4be49faafc750563351a126e7bafe29.json or via microblob (but http we do not need), or sqlite3 (https://www.sqlite.org/fasterthanfs.html)
+
+For verification we need to have the cached json blobs in some fast,
+thread-safe store. Estimated: 1K/s accesses, we still would need a few hours
+for a run.
+
+* [ ] find all ids we need, generate cache, maybe reduce number of fields
+* [ ] run verification on each cluster; generate a file of same format of
+  "verified" clusters; take note the clustering and verification method
+
+Overall, we can combine various clustering and verification methods. We can
+also put together a list of maybe 100-200 test cases and evaluate methods.
author	Martin Czygan <martin.czygan@gmail.com>	2020-10-22 11:28:57 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2020-10-22 11:28:57 +0200
commit	9aeacc07be8151a0d44d25cbe377c9f4a09a620a (patch)
tree	09343be2bf28c6155b0a84440d0b8b9b42b3c598
parent	f01429bbc70cf8ed8cbc114956cd37236e65fd4a (diff)
download	fuzzycat-9aeacc07be8151a0d44d25cbe377c9f4a09a620a.tar.gz fuzzycat-9aeacc07be8151a0d44d25cbe377c9f4a09a620a.zip