From 622f56b066316b0f16a9cb087040ee7acaaecaeb Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Wed, 3 Feb 2021 21:34:50 +0100 Subject: note on extra tool --- README.md | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 37d665c..db4d3ed 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ items. ## Dataset For development, we worked on a `release_export_expanded.json` dump (113G/700G -zstd/plain, 154203375 lines) and with the [fatcat +zstd/plain, 154,203,375 lines) and with the [fatcat API](https://api.fatcat.wiki/). The development workflow looked something like the following. @@ -61,7 +61,7 @@ Following algorithms are implemented (or planned): Example running clustering: ``` -$ python -m fuzzycat cluster -t tsandcrawler < data/re.json > cluster.json.zst +$ python -m fuzzycat cluster -t tsandcrawler < data/re.json | zstd -c -T0 > cluster.json.zst ``` Clustering works in a three step process: @@ -70,6 +70,20 @@ Clustering works in a three step process: 2. sorting by keys (via [GNU sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html)) 3. group by key and write out ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby)) +Note: For long running processes, this all-or-nothing approach is impractical; +e.g. running clustering on the joint references and fatcat dataset (2B records) +takes 24h+. + +Ideas: + +* [ ] make (sorted) key extraction a fast standalone thing + +> `cat data.jsonl | fuzzycat-key --algo X > data.key.tsv` + +Where `data.key` group (id, key, blob) or the like. Make this line speed (maybe +w/ rust). Need to carry the blob, as we do not want to restrict options. + + ## Verification Run verification (pairwise *double-check* of match candidates in a cluster). -- cgit v1.2.3