1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
|
# Proposal: Fuzzycat Refactoring
* Goal: Refactor fuzzycat to make matching and verification more composable,
configurable and testable.
* Status: wip
A better design.
* it has a correct scope (e.g. match X; very Y)
* it has good defaults, but allows configuration
* it is clear how and where to extend functionality
* it is easy to add one new test for a case
## Matching
* fuzzy matching will be a cascade of queries, until a result is returned
* there is an order of queries from exact to very fuzzy
* alternatively, we could use "ensemble matching", that takes the intersection of a couple of queries
* ES queries cannot cover all cases, we need to add additional checks; e.g. author list comparison
Example
FuzzyReleaseMatcher
match_release_id
match_release_exact_title_exact_contrib
match_release_...
match_release_fuzzy (runs a cascade of queries)
Each function is testable on its own. The class keeps the es client and other
global config around. It's scope is clear: given a "release" (or maybe just a
title string), generate a list of potentially related releases.
Other entities follow the same pattern.
FuzzyContainerMatcher
match_container_id
match_container_issn
match_container_abbreviation
match_container_...
match_container_fuzzy (runs a cascade of queries)
A helper object (not exactly the entity) for matching list of authors. Allows
to match by various means, e.g. exact, short names, partial lists, etc. Should
account for case, order, etc.
FuzzyContribsMatcher
match_exact
match_short_names
match_partial_list
match_fuzzy
For each method in each matcher class, we can construct a test case only for
one particular method. A new method can be added with easy and tested separately.
Don't know how yet, but we can create some "profiles" that allow for a matching
by a set of methods. Or use good defaults on the higher level `_fuzzy(...)` method.
NOTE: the matcher classes could use the verification code internally; generate
a list of matches with an es query, then use a configured verifier to generate
verified matches; only put comparison code into verification module.
## Verification (comparison)
Verification works similarly. For each entity we define a set of methods, verifying a specific aspect.
FuzzyReleaseVerifier
verify_release_id
verify_release_ext_id
verify_release_title_exact_match
verify_release_title_contrib_exact_match
verify_release_...
verify(a, b) -> (Status, Reason)
A large number of test cases are already there, may need a bit better structure
to relate cases to methods. The class can hold global configuration, maybe some
cached computed properties, if that helps.
FuzzyContainerVerifier
verify_container_id
...
|