aboutsummaryrefslogtreecommitdiffstats
path: root/notes/2021_11_fuzzycat_refactoring.md
blob: 171cee324a50bbf07f6bc217355d266807a1ce7b (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Proposal: Fuzzycat Refactoring

* Goal: Refactor fuzzycat to make matching and verification more composable,
  configurable and testable.
* Status: wip

A better design.

* it has a correct scope (e.g. match X; very Y)
* it has good defaults, but allows configuration
* it is clear how and where to extend functionality
* it is easy to add one new test for a case

## Matching

* fuzzy matching will be a cascade of queries, until a result is returned
* there is an order of queries from exact to very fuzzy
* alternatively, we could use "ensemble matching", that takes the intersection of a couple of queries
* ES queries cannot cover all cases, we need to add additional checks; e.g. author list comparison

Example

    FuzzyReleaseMatcher
        match_release_id
        match_release_exact_title_exact_contrib
        match_release_...

        match_release_fuzzy (runs a cascade of queries)

Each function is testable on its own. The class keeps the es client and other
global config around. It's scope is clear: given a "release" (or maybe just a
title string), generate a list of potentially related releases.

Other entities follow the same pattern.

    FuzzyContainerMatcher
        match_container_id
        match_container_issn
        match_container_abbreviation
        match_container_...

        match_container_fuzzy (runs a cascade of queries)

A helper object (not exactly the entity) for matching list of authors. Allows
to match by various means, e.g. exact, short names, partial lists, etc. Should
account for case, order, etc.

    FuzzyContribsMatcher
        match_exact
        match_short_names
        match_partial_list

        match_fuzzy

For each method in each matcher class, we can construct a test case only for
one particular method. A new method can be added with easy and tested separately.

Don't know how yet, but we can create some "profiles" that allow for a matching
by a set of methods. Or use good defaults on the higher level `_fuzzy(...)` method.

NOTE: the matcher classes could use the verification code internally; generate
a list of matches with an es query, then use a configured verifier to generate
verified matches; only put comparison code into verification module.

## Verification (comparison)

Verification works similarly. For each entity we define a set of methods, verifying a specific aspect.

    FuzzyReleaseVerifier
        verify_release_id
        verify_release_ext_id
        verify_release_title_exact_match
        verify_release_title_contrib_exact_match
        verify_release_...

        verify(a, b) -> (Status, Reason)

A large number of test cases are already there, may need a bit better structure
to relate cases to methods. The class can hold global configuration, maybe some
cached computed properties, if that helps.


    FuzzyContainerVerifier
        verify_container_id
        ...