# Clustering Original dataset: ``` $ sha1sum release_export_expanded.json.zst fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst $ zstdcat -T0 release_export_expanded.json.zst | wc -l ``` Various clusters (title, title normalized, title nysiis (New York State Identification and Intelligence System, ...): ``` $ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json ``` Parallel: ``` $ zstdcat -T0 release_export_expanded.json.zst | \ parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \ fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json ``` Numbers of clusters: ``` 141022216 cluster_title.json 134709771 cluster_title_normalized.json 119829458 cluster_title_nysiis.json ``` # TODO * [ ] do a SS like clustering, using title and author ngrams * [ ] cluster by doi without "vX" suffix