diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 10:52:30 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 10:52:30 +0200 |
commit | f01429bbc70cf8ed8cbc114956cd37236e65fd4a (patch) | |
tree | 2d9bd5a37dc78726b8a15fb931a55b67af73452d /notes | |
parent | 3e5aa503d69f6090698d55e1f03648b4628be069 (diff) | |
download | fuzzycat-f01429bbc70cf8ed8cbc114956cd37236e65fd4a.tar.gz fuzzycat-f01429bbc70cf8ed8cbc114956cd37236e65fd4a.zip |
update cluster notes
Diffstat (limited to 'notes')
-rw-r--r-- | notes/Clustering.md | 27 |
1 files changed, 27 insertions, 0 deletions
diff --git a/notes/Clustering.md b/notes/Clustering.md index 754852d..d390035 100644 --- a/notes/Clustering.md +++ b/notes/Clustering.md @@ -4,8 +4,35 @@ Original dataset: ``` $ sha1sum release_export_expanded.json.zst +fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst $ zstdcat -T0 release_export_expanded.json.zst | wc -l ``` +Various clusters (title, title normalized, title nysiis (New York State +Identification and Intelligence System, ...): +``` +$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json +``` + +Parallel: + +``` +$ zstdcat -T0 release_export_expanded.json.zst | \ + parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \ + fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json +``` + +Numbers of clusters: + +``` + 141022216 cluster_title.json + 134709771 cluster_title_normalized.json + 119829458 cluster_title_nysiis.json +``` + +# TODO + +* [ ] do a SS like clustering, using title and author ngrams +* [ ] cluster by doi without "vX" suffix |