From 0c84af603894049dd8edd95da18d8990ab0516d1 Mon Sep 17 00:00:00 2001 From: Martin Czygan Date: Fri, 5 Nov 2021 17:19:07 +0100 Subject: turn "match_release_fuzzy" into a class Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`. --- TODO.md | 34 +++++++++++++++++++--------------- 1 file changed, 19 insertions(+), 15 deletions(-) (limited to 'TODO.md') diff --git a/TODO.md b/TODO.md index 5666bc0..9241b60 100644 --- a/TODO.md +++ b/TODO.md @@ -1,28 +1,32 @@ # TODO * [ ] clustering should be broken up, e.g. into "map" and "sort" +* [ ] match release fuzzy should work not just with title +* [ ] match container name functions (maybe also with abbreviations, etc) +* [ ] better documentation, more examples +* [ ] shiv based packaging +* [ ] author similarity should be broken up; easier to tweak +* [ ] split up `verify` +* [ ] configurable `verify` + +Other repos: -In -[refcat/skate](https://gitlab.com/internetarchive/refcat/-/tree/master/skate) -we have one simple operation: extract a list of fields from blob of bytes. We -use [16 -mappers](https://gitlab.com/internetarchive/refcat/-/blob/f33e586d11f5f575f71ad209608ac9ba74fad2e5/skate/cmd/skate-map/main.go#L70-86) -currently, they are easy to write. +* [refcat/skate](https://gitlab.com/internetarchive/refcat/-/tree/master/skate) + +In refcat we have one simple operation: extract a list of fields from blob of +bytes. We use [16 mappers](https://is.gd/E0NEXj) currently, they are easy to +write. In refcat, we use GNU sort, and just when we need it, e.g. -[skate-map](https://gitlab.com/internetarchive/refcat/-/blob/f33e586d11f5f575f71ad209608ac9ba74fad2e5/python/refcat/tasks.py#L531-534). +[skate-map](https://is.gd/Kt9hvL). The `Cluster` class bundles, iteration, key extraction, sorting and group by operation into a single entity. Also in refcat, we do not work on a single file with clusters any more, but -mostly with two sorted streams, which are iterated over "comm" style. This -spares us an extra step of generating the cluster documents, but requires an -extra component, that allows to plug in various "reduce" functions. In refcat, -this component is called "zipkey", which is support batching, too. +mostly with two sorted streams, which are iterated over "mergesort/comm" style. -* [ ] match release fuzzy should work not just with title -* [ ] match container name functions (maybe also with abbreviations, etc) -* [ ] better documentation, more examples -* [ ] shiv based packaging +This spares us an extra step of generating the cluster documents, but requires +an extra component, that allows to plug in various "reduce" functions. In +refcat, this component is called "zipkey", which is support batching, too. -- cgit v1.2.3