diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-09-14 00:44:44 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-09-14 00:44:44 +0200 |
commit | a1c908b83d6e07c4065be52e405977114a9f37c4 (patch) | |
tree | 9044df043c5d6ab86e0b88aafa80cd04b555c915 | |
parent | d3ffa9981f0c7e50cef256a2bfbb7b80caa1eba3 (diff) | |
download | fuzzycat-a1c908b83d6e07c4065be52e405977114a9f37c4.tar.gz fuzzycat-a1c908b83d6e07c4065be52e405977114a9f37c4.zip |
add todo
-rw-r--r-- | TODO.md | 28 |
1 files changed, 28 insertions, 0 deletions
@@ -0,0 +1,28 @@ +# TODO + +* [ ] clustering should be broken up, e.g. into "map" and "sort" + +In +[refcat/skate](https://gitlab.com/internetarchive/refcat/-/tree/master/skate) +we have one simple operation: extract a list of fields from blob of bytes. We +use [16 +mappers](https://gitlab.com/internetarchive/refcat/-/blob/f33e586d11f5f575f71ad209608ac9ba74fad2e5/skate/cmd/skate-map/main.go#L70-86) +currently, they are easy to write. + +In refcat, we use GNU sort, and just when we need it, e.g. +[skate-map](https://gitlab.com/internetarchive/refcat/-/blob/f33e586d11f5f575f71ad209608ac9ba74fad2e5/python/refcat/tasks.py#L531-534). + +The `Cluster` class bundles, iteration, key extraction, sorting and group by +operation into a single entity. + +Also in refcat, we do not work on a single file with clusters any more, but +mostly with two sorted streams, which are iterated over "comm" style. This +spares us an extra step of generating the cluster documents, but requires an +extra component, that allows to plug in various "reduce" functions. In refcat, +this component is called "zipkey", which is support batching, too. + +* [ ] match release fuzzy should work not just with title +* [ ] match container name functions (maybe also with abbreviations, etc) +* [ ] better documentation, more examples +* [ ] shiv based packaging + |