From 621f50e685d9beeb1fe502a133e76fbd5a8a9c5c Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Tue, 24 Nov 2020 15:06:34 +0100 Subject: cleanup --- notes/clustering.md | 102 ---------------------------------------------------- 1 file changed, 102 deletions(-) delete mode 100644 notes/clustering.md (limited to 'notes/clustering.md') diff --git a/notes/clustering.md b/notes/clustering.md deleted file mode 100644 index 3f6312c..0000000 --- a/notes/clustering.md +++ /dev/null @@ -1,102 +0,0 @@ -# Clustering - -Original dataset: - -``` -$ sha1sum release_export_expanded.json.zst -fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst - -$ zstdcat -T0 release_export_expanded.json.zst | wc -l -``` - -Various clusters (title, title normalized, title nysiis (New York State -Identification and Intelligence System, ...): - -``` -$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json -``` - -Parallel (TODO: use `--pipepart`): - -``` -$ zstdcat -T0 release_export_expanded.json.zst | \ - parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \ - fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json -``` - -Numbers of clusters: - -``` - 141022216 cluster_title.json - 134709771 cluster_title_normalized.json - 119829458 cluster_title_nysiis.json -``` - -The number of duplicate record goes up as number of clusters go down: - -``` - 2858088 cluster_title_dups.json - 5818143 cluster_title_normalized_dups.json - 6274940 cluster_title_nysiis_dups.json -``` - -# Cluster numbers - -Using normalized title as example: - -* 4306860 have cluster size 2, 1511283 have cluster size 3 or larger - -``` - size len -count 5818143.000 5818143.000 -mean 4.350 52.120 -std 196.347 35.026 -min 2.000 0.000 -25% 2.000 24.000 -50% 2.000 46.000 -75% 3.000 72.000 -max 151383.000 11686.000 -``` - -Around 448170 clusters with size 5 or more (with some example titles): - -``` -Medical Notes -日本鉄鋼協会第97回講演大会講演概要 -Boutades -Allergic Contact Dermatitis -Comité international -Incontinence -Efficient Uncertainty Minimization for Fuzzy Spectral Clustering -Early Intervention -CURRENT READINGS IN NUCLEAR MEDICINE -Nannocystis exedens -``` - -Grouping. API, hide. - -* gnu parallel; top, htop; how much; "chunks"; read one line; "pipeart"; - batching; "read from a file"; scan a file; "chunking" - -# TODO - -* [ ] do a SS like clustering, using title and author ngrams -* [ ] cluster by doi without "vX" suffix - -# Verification - -* we only need to look at identified duplicates, which will be a few millions -* we want fast access to all release JSON blob via ident, maybe do a - "fuzzycat-cache" that copies relevant files into the fs, e.g. -"~/.cache/fuzzycat/releases/d9/e4d4be49faafc750563351a126e7bafe29.json or via microblob (but http we do not need), or sqlite3 (https://www.sqlite.org/fasterthanfs.html) - -For verification we need to have the cached json blobs in some fast, -thread-safe store. Estimated: 1K/s accesses, we still would need a few hours -for a run. - -* [ ] find all ids we need, generate cache, maybe reduce number of fields -* [ ] run verification on each cluster; generate a file of same format of - "verified" clusters; take note the clustering and verification method - -Overall, we can combine various clustering and verification methods. We can -also put together a list of maybe 100-200 test cases and evaluate methods. -- cgit v1.2.3