diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-12-09 22:59:47 +0100 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-12-09 22:59:47 +0100 |
commit | 9d707a0203ac3aaf17e266a0f5a934b5f9e2dbbf (patch) | |
tree | ec9829adf87bc49de8245207c50342e8434628be | |
parent | bae9820e4203f8ab692a2b1ba4c9aa4207b425c6 (diff) | |
download | fuzzycat-9d707a0203ac3aaf17e266a0f5a934b5f9e2dbbf.tar.gz fuzzycat-9d707a0203ac3aaf17e266a0f5a934b5f9e2dbbf.zip |
update README
-rw-r--r-- | README.md | 15 |
1 files changed, 14 insertions, 1 deletions
@@ -65,7 +65,7 @@ Single threaded, 42h. ``` $ time zstdcat -T0 release_export_expanded.json.zst | \ - TMPDIR=/bigger/tmp python -m fuzzycat cluster --tmpdir /bigger/tmp -t tsandcrawler \ | + TMPDIR=/bigger/tmp python -m fuzzycat cluster --tmpdir /bigger/tmp -t tsandcrawler | \ zstd -c9 > cluster_tsandcrawler.json.zst { "key_fail": 0, @@ -82,6 +82,19 @@ sys 118m38.141s So, 29881072 (about 20%) docs in the potentially duplicated set. +Verification (about 15h): + +``` +$ time zstdcat -T0 cluster_tsandcrawler.json.zst | python -m fuzzycat verify | \ + zstd -c9 > cluster_tsandcrawler_verified_3c7378.tsv.zst + +... + +real 927m28.631s +user 939m32.761s +sys 36m47.602s +``` + # Use cases |