aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-10-22 20:15:46 +0200
committerMartin Czygan <martin.czygan@gmail.com>2020-10-22 20:15:46 +0200
commit2b216f17fccf6ff90b41ca972bf1730078cc6180 (patch)
tree4cf53ef1d9cec359e81251eebbd6aff2ad04b4b5 /notes
parent38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (diff)
downloadfuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.tar.gz
fuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.zip
update notes on cluster, nb
Diffstat (limited to 'notes')
-rw-r--r--notes/Clustering.md48
1 files changed, 47 insertions, 1 deletions
diff --git a/notes/Clustering.md b/notes/Clustering.md
index d794bdc..95baea3 100644
--- a/notes/Clustering.md
+++ b/notes/Clustering.md
@@ -16,7 +16,7 @@ Identification and Intelligence System, ...):
$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json
```
-Parallel:
+Parallel (use `--pipepart`):
```
$ zstdcat -T0 release_export_expanded.json.zst | \
@@ -32,6 +32,52 @@ Numbers of clusters:
119829458 cluster_title_nysiis.json
```
+The number of duplicate record goes up as number of clusters go down:
+
+```
+ 2858088 cluster_title_dups.json
+ 5818143 cluster_title_normalized_dups.json
+ 6274940 cluster_title_nysiis_dups.json
+```
+
+# Cluster numbers
+
+Using normalized title as example:
+
+* 4306860 have cluster size 2, 1511283 have cluster size 3 or larger
+
+```
+ size len
+count 5818143.000 5818143.000
+mean 4.350 52.120
+std 196.347 35.026
+min 2.000 0.000
+25% 2.000 24.000
+50% 2.000 46.000
+75% 3.000 72.000
+max 151383.000 11686.000
+```
+
+Around 448170 clusters with size 5 or more (with some example titles):
+
+```
+Medical Notes
+日本鉄鋼協会第97回講演大会講演概要
+Boutades
+Allergic Contact Dermatitis
+Comité international
+Incontinence
+Efficient Uncertainty Minimization for Fuzzy Spectral Clustering
+Early Intervention
+CURRENT READINGS IN NUCLEAR MEDICINE
+Nannocystis exedens
+```
+
+Grouping. API, hide.
+
+* gnu parallel; top, htop; how much; "chunks"; read one line; "pipeart";
+ batching; "read from a file"; scan a file; "chunking"
+
# TODO
* [ ] do a SS like clustering, using title and author ngrams