aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-10-22 10:52:30 +0200
committerMartin Czygan <martin.czygan@gmail.com>2020-10-22 10:52:30 +0200
commitf01429bbc70cf8ed8cbc114956cd37236e65fd4a (patch)
tree2d9bd5a37dc78726b8a15fb931a55b67af73452d /notes
parent3e5aa503d69f6090698d55e1f03648b4628be069 (diff)
downloadfuzzycat-f01429bbc70cf8ed8cbc114956cd37236e65fd4a.tar.gz
fuzzycat-f01429bbc70cf8ed8cbc114956cd37236e65fd4a.zip
update cluster notes
Diffstat (limited to 'notes')
-rw-r--r--notes/Clustering.md27
1 files changed, 27 insertions, 0 deletions
diff --git a/notes/Clustering.md b/notes/Clustering.md
index 754852d..d390035 100644
--- a/notes/Clustering.md
+++ b/notes/Clustering.md
@@ -4,8 +4,35 @@ Original dataset:
```
$ sha1sum release_export_expanded.json.zst
+fa7ce335e27bbf6ccee227992ecd9b860e8e36af release_export_expanded.json.zst
$ zstdcat -T0 release_export_expanded.json.zst | wc -l
```
+Various clusters (title, title normalized, title nysiis (New York State
+Identification and Intelligence System, ...):
+```
+$ zstdcat -T0 release_export_expanded.json.zst | fuzzycat-cluster -t title > cluster_title.json
+```
+
+Parallel:
+
+```
+$ zstdcat -T0 release_export_expanded.json.zst | \
+ parallel --tmpdir /bigger/tmp --roundrobin --pipe -j 16 \
+ fuzzycat-cluster --tmpdir /bigger/tmp -t title > cluster_title.json
+```
+
+Numbers of clusters:
+
+```
+ 141022216 cluster_title.json
+ 134709771 cluster_title_normalized.json
+ 119829458 cluster_title_nysiis.json
+```
+
+# TODO
+
+* [ ] do a SS like clustering, using title and author ngrams
+* [ ] cluster by doi without "vX" suffix