From f01429bbc70cf8ed8cbc114956cd37236e65fd4a Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Thu, 22 Oct 2020 10:52:30 +0200 Subject: update cluster notes --- notes/Clustering.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) (limited to 'notes') diff --git a/notes/Clustering.md b/notes/Clustering.md index 754852d..d390035 100644 --- a/notes/Clustering.md +++ b/notes/Clustering.md @@ -4,8 +4,35 @@ Original dataset: ``` $ sha1sum release_export_expanded.json.zst +fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst $ zstdcat -T0 release_export_expanded.json.zst | wc -l ``` +Various clusters (title, title normalized, title nysiis (New York State +Identification and Intelligence System, ...): +``` +$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json +``` + +Parallel: + +``` +$ zstdcat -T0 release_export_expanded.json.zst | \ + parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \ + fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json +``` + +Numbers of clusters: + +``` + 141022216 cluster_title.json + 134709771 cluster_title_normalized.json + 119829458 cluster_title_nysiis.json +``` + +# TODO + +* [ ] do a SS like clustering, using title and author ngrams +* [ ] cluster by doi without "vX" suffix -- cgit v1.2.3