diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-11-27 22:03:53 +0100 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-11-27 22:03:53 +0100 |
commit | ccc4e5ecafce186e20dece55d31b31e198201438 (patch) | |
tree | cebc3268fbb0910a2346c350cdd3b52c16e76628 /notes | |
parent | ee00eb1452e918fce4528c7cd9a6c1e51dbb90dc (diff) | |
download | fuzzycat-ccc4e5ecafce186e20dece55d31b31e198201438.tar.gz fuzzycat-ccc4e5ecafce186e20dece55d31b31e198201438.zip |
subtitle: default to list
Diffstat (limited to 'notes')
-rw-r--r-- | notes/2020_11_testruns.md | 31 |
1 files changed, 31 insertions, 0 deletions
diff --git a/notes/2020_11_testruns.md b/notes/2020_11_testruns.md new file mode 100644 index 0000000..31c292c --- /dev/null +++ b/notes/2020_11_testruns.md @@ -0,0 +1,31 @@ +# Test runs + +## Using --min-cluster-size + +Skipping writes of single element clusters cuts clustering from ~42h to ~22h. + +``` +$ time zstdcat -T0 release_export_expanded.json.zst | \ + TMPDIR=/bigger/tmp python -m fuzzycat cluster --min-cluster-size 2 \ + --tmpdir /bigger/tmp -t tsandcrawler | \ + zstd -c9 > cluster_tsandcrawler_min_cluster_size_2.json.zst +... +max cluster size cut off for: 雜報その1 +max cluster size cut off for: 雜録 +2020-11-27 18:31:39.825 DEBUG __main__ - run_cluster: {"key_fail": 0, "key_ok": +154202433, "key_empty": 942, "key_denylist": 0, "num_clusters": 11763096} + +real 1328m46.994s +user 1088m6.837s +sys 98m17.501s +``` + +We find 11763096 clusters, 16GB compressed (zstdcat takes about 5min, +sequential read at 50M/s). + +``` +$ time zstdcat -T0 cluster_tsandcrawler_min_cluster_size_2.json.zst | \ + python -m fuzzycat verify | \ + zstd -T0 -c9 > cluster_tsandcrawler_min_cluster_size_2_verify.tsv.zst +``` + |