diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 20:15:46 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 20:15:46 +0200 |
commit | 2b216f17fccf6ff90b41ca972bf1730078cc6180 (patch) | |
tree | 4cf53ef1d9cec359e81251eebbd6aff2ad04b4b5 /notes | |
parent | 38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (diff) | |
download | fuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.tar.gz fuzzycat-2b216f17fccf6ff90b41ca972bf1730078cc6180.zip |
update notes on cluster, nb
Diffstat (limited to 'notes')
-rw-r--r-- | notes/Clustering.md | 48 |
1 files changed, 47 insertions, 1 deletions
diff --git a/notes/Clustering.md b/notes/Clustering.md index d794bdc..95baea3 100644 --- a/notes/Clustering.md +++ b/notes/Clustering.md @@ -16,7 +16,7 @@ Identification and Intelligence System, ...): $ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json ``` -Parallel: +Parallel (use `--pipepart`): ``` $ zstdcat -T0 release_export_expanded.json.zst | \ @@ -32,6 +32,52 @@ Numbers of clusters: 119829458 cluster_title_nysiis.json ``` +The number of duplicate record goes up as number of clusters go down: + +``` + 2858088 cluster_title_dups.json + 5818143 cluster_title_normalized_dups.json + 6274940 cluster_title_nysiis_dups.json +``` + +# Cluster numbers + +Using normalized title as example: + +* 4306860 have cluster size 2, 1511283 have cluster size 3 or larger + +``` + size len +count 5818143.000 5818143.000 +mean 4.350 52.120 +std 196.347 35.026 +min 2.000 0.000 +25% 2.000 24.000 +50% 2.000 46.000 +75% 3.000 72.000 +max 151383.000 11686.000 +``` + +Around 448170 clusters with size 5 or more (with some example titles): + +``` +Medical Notes +日本鉄鋼協会第97回講演大会講演概要 +Boutades +Allergic Contact Dermatitis +Comité international +Incontinence +Efficient Uncertainty Minimization for Fuzzy Spectral Clustering +Early Intervention +CURRENT READINGS IN NUCLEAR MEDICINE +Nannocystis exedens +``` + +Grouping. API, hide. + +* gnu parallel; top, htop; how much; "chunks"; read one line; "pipeart"; + batching; "read from a file"; scan a file; "chunking" + # TODO * [ ] do a SS like clustering, using title and author ngrams |