TODO.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

# TODO

* [ ] match release with fewer requests (or do them in parallel)
* [ ] de-clobber verify

----

* [ ] clustering should be broken up, e.g. into "map" and "sort"
* [x] match release should be a class
* [x] match release fuzzy should work not just with title
* [ ] match container name functions (maybe also with abbreviations, etc)
* [ ] better documentation, more examples
* [ ] shiv based packaging
* [ ] author similarity should be broken up; easier to tweak
* [ ] split up `verify`
* [ ] configurable `verify`

Other repos:

* [refcat/skate](https://gitlab.com/internetarchive/refcat/-/tree/master/skate)

In refcat we have one simple operation: extract a list of fields from blob of
bytes. We use [16 mappers](https://is.gd/E0NEXj) currently, they are easy to
write.

In refcat, we use GNU sort, and just when we need it, e.g.
[skate-map](https://is.gd/Kt9hvL).

The `Cluster` class bundles, iteration, key extraction, sorting and group by
operation into a single entity.

Also in refcat, we do not work on a single file with clusters any more, but
mostly with two sorted streams, which are iterated over "mergesort/comm" style.

This spares us an extra step of generating the cluster documents, but requires
an extra component, that allows to plug in various "reduce" functions. In
refcat, this component is called "zipkey", which is support batching, too.