aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2021-02-03 21:34:50 +0100
committerMartin Czygan <martin.czygan@gmail.com>2021-02-03 21:34:50 +0100
commit622f56b066316b0f16a9cb087040ee7acaaecaeb (patch)
tree99c7df7543044934fea5c6d9af5258e9b428fbd9
parentcc34369216d747d9d6721475d344cd458891e6a0 (diff)
downloadfuzzycat-622f56b066316b0f16a9cb087040ee7acaaecaeb.tar.gz
fuzzycat-622f56b066316b0f16a9cb087040ee7acaaecaeb.zip
note on extra tool
-rw-r--r--README.md18
1 files changed, 16 insertions, 2 deletions
diff --git a/README.md b/README.md
index 37d665c..db4d3ed 100644
--- a/README.md
+++ b/README.md
@@ -39,7 +39,7 @@ items.
## Dataset
For development, we worked on a `release_export_expanded.json` dump (113G/700G
-zstd/plain, 154203375 lines) and with the [fatcat
+zstd/plain, 154,203,375 lines) and with the [fatcat
API](https://api.fatcat.wiki/).
The development workflow looked something like the following.
@@ -61,7 +61,7 @@ Following algorithms are implemented (or planned):
Example running clustering:
```
-$ python -m fuzzycat cluster -t tsandcrawler < data/re.json > cluster.json.zst
+$ python -m fuzzycat cluster -t tsandcrawler < data/re.json | zstd -c -T0 > cluster.json.zst
```
Clustering works in a three step process:
@@ -70,6 +70,20 @@ Clustering works in a three step process:
2. sorting by keys (via [GNU sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html))
3. group by key and write out ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby))
+Note: For long running processes, this all-or-nothing approach is impractical;
+e.g. running clustering on the joint references and fatcat dataset (2B records)
+takes 24h+.
+
+Ideas:
+
+* [ ] make (sorted) key extraction a fast standalone thing
+
+> `cat data.jsonl | fuzzycat-key --algo X > data.key.tsv`
+
+Where `data.key` group (id, key, blob) or the like. Make this line speed (maybe
+w/ rust). Need to carry the blob, as we do not want to restrict options.
+
+
## Verification
Run verification (pairwise *double-check* of match candidates in a cluster).