# Fuzzy matching review and retrospective > 2021-09-15 After [refcat](https://gitlab.com/internetarchive/refcat) has reached a milestone, I'd like to review fuzzycat and fuzzy matching in general; this should help pave the way to a slight redesign of the overall approach. ## TL;DR * performance matters at scale and a faster language (e.g. Go) is essential * for small scale, the api matters more than performance * a lot of the code currently is base on specific schemas (e.g. release, a specific elasticsearch mapping, etc), so not that much code is generic or reusable - it also seems overkill to try to abstract the schema away ## Ideas * [ ] use pydantic or dataclass to make schema more explicit * [ ] extend type annotation coverage * [ ] remove bulk stuff, remove clustering etc; improve the verification part * [ ] use cases: work merging A few more things to revisit: * [ ] revisit journal matching; what is weak, strong? * [ ] refactor: author list or string comparison * [ ] go beyond title matching when querying elasticsearch * [ ] better container name matching Take a look at: > https://github.com/djudd/human-name ---- ## Redesign ideas ### Large scale processing JSON decoding and encoding does not seem to be the bottleneck, but working with various, often optional fields gets expensive in Python (whereas in Go, we can use a struct). The mapping stage in [refcat/skate](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/map.go#L109-111) is a simple operation (blob to fields), that can be implemented in isolation and then [added to the command line](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/cmd/skate-map/main.go#L67-87). In skate, we already have over a dozen mappers working on various types. There's even a bit of [map middleware](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/map.go#L152-161). In fuzzycat, the [Cluster](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L309-347) class does mapping, via [key](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L331), [sorting](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L406-426), and a specific [grouping](https://git.archive.org/webgroup/fuzzycat/-/blob/c587a084defe54103aa147b7ab91542a11a548b1/fuzzycat/cluster.py#L428-454) all in one go. For example, we did not use the single cluster document in refcat/skate anymore (there, we keep two separate files and use an extra [zipkey](https://gitlab.com/internetarchive/refcat/-/blob/3a79551dfe54ba668f7eee9de88625a0d33d9c7f/skate/zipkey/zipkey.go#L23-33) type, which is a slightly generalized [comm](https://en.wikipedia.org/wiki/Comm), e.g. it allows to run a function over a cluster of documents (coming from currently two streams). A higher level command could encapsulate the whole pipeline, without needed an extra framework like luigi: inputs A B | | mapped M1 M2 | | sorted S1 S2 \ / \ / reduced V | | C > Not sure, if we need mappers at all, if we have them in refcat. An a command could look like this. $ fuzzycat pipeline -a A.json -b B.json --mapper-a "tn" --mapper-b "tn" --reduce "bref" Nice, if this would actually run fast. Could also be run programmatically: output = fuzzycat_pipeline(a="A.json", b="B.json", mapper_a="tn", mapper_b="tn", reduce="bref") Mappers should have a minimal scope; each mapper will have a format it can work on. Reducers will have two inputs types specified. ### Running continuously > With a couple of the inputs (metadata, extracted data, ...) getting updated > all the time, it might for the moment be simpler to rerun the derivation in > batch mode. A list of steps we would need to implement for continuous reference index updates: * a new metadata document arrives (e.g. via "changelog") * if the metadata contains outbound references, nice; if not, we try to download the associated PDF, run grobid and get the references out that way At this point, we have the easier part - outbound references - covered. Where do the outbound references of all existing docs live? In the database only, hence we cannot search for them currently. * [7ppmkfo5krb2zhefhwkwdp4mqe](https://search.fatcat.wiki/fatcat_release/_search?q=ident:7ppmkfo5krb2zhefhwkwdp4mqe) says `ref_count` 12, but the list of refs we can only get via [api](https://api.fatcat.wiki/v0/release/7ppmkfo5krb2zhefhwkwdp4mqe) We could add another elasticsearch index only for the raw refs. E.g. everytime an item is updated, this index gets updated as well (taking refs from the API and put them into ES). We can then query for any ID we find in the reference or any string match, etc. Once we find e.g. ten documents, that have the document in question in their reference list, we can update the reference index for each of these documents. We could keep a (weekly) refs snapshot file around that would be used for matching. The result would be the e.g. ten document, that refer to the document in question. We can take their ids and update the document to establish the link. The on-disk file (or files) should be all prepared, e.g. sorted by key, so the lookup will be fast.