note on extra tool

author: Martin Czygan <martin.czygan@gmail.com> 2021-02-03 21:34:50 +0100
committer: Martin Czygan <martin.czygan@gmail.com> 2021-02-03 21:34:50 +0100
commit: 622f56b066316b0f16a9cb087040ee7acaaecaeb (patch)
tree: 99c7df7543044934fea5c6d9af5258e9b428fbd9
parent: cc34369216d747d9d6721475d344cd458891e6a0 (diff)
download: fuzzycat-622f56b066316b0f16a9cb087040ee7acaaecaeb.tar.gz
fuzzycat-622f56b066316b0f16a9cb087040ee7acaaecaeb.zip
1 files changed, 16 insertions, 2 deletions
diff --git a/README.md b/README.md
index 37d665c..db4d3ed 100644
--- a/README.md
+++ b/README.md
@@ -39,7 +39,7 @@ items.
 ## Dataset
 
 For development, we worked on a `release_export_expanded.json` dump (113G/700G
-zstd/plain, 154203375 lines) and with the [fatcat
+zstd/plain, 154,203,375 lines) and with the [fatcat
 API](https://api.fatcat.wiki/).
 
 The development workflow looked something like the following.
@@ -61,7 +61,7 @@ Following algorithms are implemented (or planned):
 Example running clustering:
 
 ```
-$ python -m fuzzycat cluster -t tsandcrawler < data/re.json > cluster.json.zst
+$ python -m fuzzycat cluster -t tsandcrawler < data/re.json | zstd -c -T0 > cluster.json.zst
 ```
 
 Clustering works in a three step process:
@@ -70,6 +70,20 @@ Clustering works in a three step process:
 2. sorting by keys (via [GNU sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html))
 3. group by key and write out ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby))
 
+Note: For long running processes, this all-or-nothing approach is impractical;
+e.g. running clustering on the joint references and fatcat dataset (2B records)
+takes 24h+.
+
+Ideas:
+
+* [ ] make (sorted) key extraction a fast standalone thing
+
+> `cat data.jsonl | fuzzycat-key --algo X > data.key.tsv`
+
+Where `data.key` group (id, key, blob) or the like. Make this line speed (maybe
+w/ rust). Need to carry the blob, as we do not want to restrict options.
+
+
 ## Verification
 
 Run verification (pairwise *double-check* of match candidates in a cluster).
author	Martin Czygan <martin.czygan@gmail.com>	2021-02-03 21:34:50 +0100
committer	Martin Czygan <martin.czygan@gmail.com>	2021-02-03 21:34:50 +0100
commit	622f56b066316b0f16a9cb087040ee7acaaecaeb (patch)
tree	99c7df7543044934fea5c6d9af5258e9b428fbd9
parent	cc34369216d747d9d6721475d344cd458891e6a0 (diff)
download	fuzzycat-622f56b066316b0f16a9cb087040ee7acaaecaeb.tar.gz fuzzycat-622f56b066316b0f16a9cb087040ee7acaaecaeb.zip