update notes on cluster, nb

author: Martin Czygan <martin.czygan@gmail.com> 2020-10-22 20:15:46 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2020-10-22 20:15:46 +0200
commit: 2b216f17fccf6ff90b41ca972bf1730078cc6180 (patch)
tree: 4cf53ef1d9cec359e81251eebbd6aff2ad04b4b5 /notes
parent: 38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (diff)
download: fuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.tar.gz
fuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.zip
1 files changed, 47 insertions, 1 deletions
diff --git a/notes/Clustering.md b/notes/Clustering.md
index d794bdc..95baea3 100644
--- a/notes/Clustering.md
+++ b/notes/Clustering.md
@@ -16,7 +16,7 @@ Identification and Intelligence System, ...):
 $ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json
 ```
 
-Parallel:
+Parallel (use `--pipepart`):
 
 ```
 $ zstdcat -T0 release_export_expanded.json.zst | \
@@ -32,6 +32,52 @@ Numbers of clusters:
   119829458 cluster_title_nysiis.json
 ```
 
+The number of duplicate record goes up as number of clusters go down:
+
+```
+   2858088 cluster_title_dups.json
+   5818143 cluster_title_normalized_dups.json
+   6274940 cluster_title_nysiis_dups.json
+```
+
+# Cluster numbers
+
+Using normalized title as example:
+
+* 4306860 have cluster size 2, 1511283 have cluster size 3 or larger
+
+```
+             size         len
+count 5818143.000 5818143.000
+mean        4.350      52.120
+std       196.347      35.026
+min         2.000       0.000
+25%         2.000      24.000
+50%         2.000      46.000
+75%         3.000      72.000
+max    151383.000   11686.000
+```
+
+Around 448170 clusters with size 5 or more (with some example titles):
+
+```
+Medical Notes
+日本鉄鋼協会第97回講演大会講演概要
+Boutades
+Allergic Contact Dermatitis
+Comité international
+Incontinence
+Efficient Uncertainty Minimization for Fuzzy Spectral Clustering
+Early Intervention
+CURRENT READINGS IN NUCLEAR MEDICINE
+Nannocystis exedens
+```
+
+Grouping. API, hide.
+
+* gnu parallel; top, htop; how much; "chunks"; read one line; "pipeart";
+  batching; "read from a file"; scan a file; "chunking"
+
 # TODO
 
 * [ ] do a SS like clustering, using title and author ngrams
author	Martin Czygan <martin.czygan@gmail.com>	2020-10-22 20:15:46 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2020-10-22 20:15:46 +0200
commit	2b216f17fccf6ff90b41ca972bf1730078cc6180 (patch)
tree	4cf53ef1d9cec359e81251eebbd6aff2ad04b4b5 /notes
parent	38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (diff)
download	fuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.tar.gz fuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.zip