From 622f56b066316b0f16a9cb087040ee7acaaecaeb Mon Sep 17 00:00:00 2001
From: Martin Czygan <martin.czygan@gmail.com>
Date: Wed, 3 Feb 2021 21:34:50 +0100
Subject: note on extra tool

---
 README.md | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 37d665c..db4d3ed 100644
--- a/README.md
+++ b/README.md
@@ -39,7 +39,7 @@ items.
 ## Dataset
 
 For development, we worked on a `release_export_expanded.json` dump (113G/700G
-zstd/plain, 154203375 lines) and with the [fatcat
+zstd/plain, 154,203,375 lines) and with the [fatcat
 API](https://api.fatcat.wiki/).
 
 The development workflow looked something like the following.
@@ -61,7 +61,7 @@ Following algorithms are implemented (or planned):
 Example running clustering:
 
 ```
-$ python -m fuzzycat cluster -t tsandcrawler < data/re.json > cluster.json.zst
+$ python -m fuzzycat cluster -t tsandcrawler < data/re.json | zstd -c -T0 > cluster.json.zst
 ```
 
 Clustering works in a three step process:
@@ -70,6 +70,20 @@ Clustering works in a three step process:
 2. sorting by keys (via [GNU sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html))
 3. group by key and write out ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby))
 
+Note: For long running processes, this all-or-nothing approach is impractical;
+e.g. running clustering on the joint references and fatcat dataset (2B records)
+takes 24h+.
+
+Ideas:
+
+* [ ] make (sorted) key extraction a fast standalone thing
+
+> `cat data.jsonl | fuzzycat-key --algo X > data.key.tsv`
+
+Where `data.key` group (id, key, blob) or the like. Make this line speed (maybe
+w/ rust). Need to carry the blob, as we do not want to restrict options.
+
+
 ## Verification
 
 Run verification (pairwise *double-check* of match candidates in a cluster).
-- 
cgit v1.2.3