diff options
Diffstat (limited to 'notes')
-rw-r--r-- | notes/Clustering.md | 27 |
1 files changed, 27 insertions, 0 deletions
diff --git a/notes/Clustering.md b/notes/Clustering.md index 754852d..d390035 100644 --- a/notes/Clustering.md +++ b/notes/Clustering.md @@ -4,8 +4,35 @@ Original dataset: ``` $ sha1sum release_export_expanded.json.zst +fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst $ zstdcat -T0 release_export_expanded.json.zst | wc -l ``` +Various clusters (title, title normalized, title nysiis (New York State +Identification and Intelligence System, ...): +``` +$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json +``` + +Parallel: + +``` +$ zstdcat -T0 release_export_expanded.json.zst | \ + parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \ + fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json +``` + +Numbers of clusters: + +``` + 141022216 cluster_title.json + 134709771 cluster_title_normalized.json + 119829458 cluster_title_nysiis.json +``` + +# TODO + +* [ ] do a SS like clustering, using title and author ngrams +* [ ] cluster by doi without "vX" suffix |