aboutsummaryrefslogtreecommitdiffstats
path: root/TODO.md
diff options
context:
space:
mode:
authorMartin Czygan <martin@archive.org>2021-11-16 19:06:26 +0000
committerMartin Czygan <martin@archive.org>2021-11-16 19:06:26 +0000
commit24dcddc4e4cff744e7c0a964856329d2ac69601d (patch)
treead8650892805e55ec4a6958f9e1539eb675332b8 /TODO.md
parent282f315c6ba3643c8c614220ab2f7e1d55de3658 (diff)
parent409392d66c3a6debe5bc69c0e2308209ac74ee35 (diff)
downloadfuzzycat-24dcddc4e4cff744e7c0a964856329d2ac69601d.tar.gz
fuzzycat-24dcddc4e4cff744e7c0a964856329d2ac69601d.zip
Merge branch 'martin-matcher-class' into 'master'
turn "match_release_fuzzy" into a class See merge request webgroup/fuzzycat!10
Diffstat (limited to 'TODO.md')
-rw-r--r--TODO.md34
1 files changed, 19 insertions, 15 deletions
diff --git a/TODO.md b/TODO.md
index 5666bc0..9241b60 100644
--- a/TODO.md
+++ b/TODO.md
@@ -1,28 +1,32 @@
# TODO
* [ ] clustering should be broken up, e.g. into "map" and "sort"
+* [ ] match release fuzzy should work not just with title
+* [ ] match container name functions (maybe also with abbreviations, etc)
+* [ ] better documentation, more examples
+* [ ] shiv based packaging
+* [ ] author similarity should be broken up; easier to tweak
+* [ ] split up `verify`
+* [ ] configurable `verify`
+
+Other repos:
-In
-[refcat/skate](https://gitlab.com/internetarchive/refcat/-/tree/master/skate)
-we have one simple operation: extract a list of fields from blob of bytes. We
-use [16
-mappers](https://gitlab.com/internetarchive/refcat/-/blob/f33e586d11f5f575f71ad209608ac9ba74fad2e5/skate/cmd/skate-map/main.go#L70-86)
-currently, they are easy to write.
+* [refcat/skate](https://gitlab.com/internetarchive/refcat/-/tree/master/skate)
+
+In refcat we have one simple operation: extract a list of fields from blob of
+bytes. We use [16 mappers](https://is.gd/E0NEXj) currently, they are easy to
+write.
In refcat, we use GNU sort, and just when we need it, e.g.
-[skate-map](https://gitlab.com/internetarchive/refcat/-/blob/f33e586d11f5f575f71ad209608ac9ba74fad2e5/python/refcat/tasks.py#L531-534).
+[skate-map](https://is.gd/Kt9hvL).
The `Cluster` class bundles, iteration, key extraction, sorting and group by
operation into a single entity.
Also in refcat, we do not work on a single file with clusters any more, but
-mostly with two sorted streams, which are iterated over "comm" style. This
-spares us an extra step of generating the cluster documents, but requires an
-extra component, that allows to plug in various "reduce" functions. In refcat,
-this component is called "zipkey", which is support batching, too.
+mostly with two sorted streams, which are iterated over "mergesort/comm" style.
-* [ ] match release fuzzy should work not just with title
-* [ ] match container name functions (maybe also with abbreviations, etc)
-* [ ] better documentation, more examples
-* [ ] shiv based packaging
+This spares us an extra step of generating the cluster documents, but requires
+an extra component, that allows to plug in various "reduce" functions. In
+refcat, this component is called "zipkey", which is support batching, too.