aboutsummaryrefslogtreecommitdiffstats
path: root/notes/clustering.md
diff options
context:
space:
mode:
Diffstat (limited to 'notes/clustering.md')
-rw-r--r--notes/clustering.md102
1 files changed, 102 insertions, 0 deletions
diff --git a/notes/clustering.md b/notes/clustering.md
new file mode 100644
index 0000000..3f6312c
--- /dev/null
+++ b/notes/clustering.md
@@ -0,0 +1,102 @@
+# Clustering
+
+Original dataset:
+
+```
+$ sha1sum release_export_expanded.json.zst
+fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst
+
+$ zstdcat -T0 release_export_expanded.json.zst | wc -l
+```
+
+Various clusters (title, title normalized, title nysiis (New York State
+Identification and Intelligence System, ...):
+
+```
+$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json
+```
+
+Parallel (TODO: use `--pipepart`):
+
+```
+$ zstdcat -T0 release_export_expanded.json.zst | \
+ parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \
+ fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json
+```
+
+Numbers of clusters:
+
+```
+ 141022216 cluster_title.json
+ 134709771 cluster_title_normalized.json
+ 119829458 cluster_title_nysiis.json
+```
+
+The number of duplicate record goes up as number of clusters go down:
+
+```
+ 2858088 cluster_title_dups.json
+ 5818143 cluster_title_normalized_dups.json
+ 6274940 cluster_title_nysiis_dups.json
+```
+
+# Cluster numbers
+
+Using normalized title as example:
+
+* 4306860 have cluster size 2, 1511283 have cluster size 3 or larger
+
+```
+ size len
+count 5818143.000 5818143.000
+mean 4.350 52.120
+std 196.347 35.026
+min 2.000 0.000
+25% 2.000 24.000
+50% 2.000 46.000
+75% 3.000 72.000
+max 151383.000 11686.000
+```
+
+Around 448170 clusters with size 5 or more (with some example titles):
+
+```
+Medical Notes
+日本鉄鋼協会第97回講演大会講演概要
+Boutades
+Allergic Contact Dermatitis
+Comité international
+Incontinence
+Efficient Uncertainty Minimization for Fuzzy Spectral Clustering
+Early Intervention
+CURRENT READINGS IN NUCLEAR MEDICINE
+Nannocystis exedens
+```
+
+Grouping. API, hide.
+
+* gnu parallel; top, htop; how much; "chunks"; read one line; "pipeart";
+ batching; "read from a file"; scan a file; "chunking"
+
+# TODO
+
+* [ ] do a SS like clustering, using title and author ngrams
+* [ ] cluster by doi without "vX" suffix
+
+# Verification
+
+* we only need to look at identified duplicates, which will be a few millions
+* we want fast access to all release JSON blob via ident, maybe do a
+ "fuzzycat-cache" that copies relevant files into the fs, e.g.
+"~/.cache/fuzzycat/releases/d9/e4d4be49faafc750563351a126e7bafe29.json or via microblob (but http we do not need), or sqlite3 (https://www.sqlite.org/fasterthanfs.html)
+
+For verification we need to have the cached json blobs in some fast,
+thread-safe store. Estimated: 1K/s accesses, we still would need a few hours
+for a run.
+
+* [ ] find all ids we need, generate cache, maybe reduce number of fields
+* [ ] run verification on each cluster; generate a file of same format of
+ "verified" clusters; take note the clustering and verification method
+
+Overall, we can combine various clustering and verification methods. We can
+also put together a list of maybe 100-200 test cases and evaluate methods.