diff options
author | Martin Czygan <martin@archive.org> | 2021-11-16 19:06:26 +0000 |
---|---|---|
committer | Martin Czygan <martin@archive.org> | 2021-11-16 19:06:26 +0000 |
commit | 24dcddc4e4cff744e7c0a964856329d2ac69601d (patch) | |
tree | ad8650892805e55ec4a6958f9e1539eb675332b8 /notes | |
parent | 282f315c6ba3643c8c614220ab2f7e1d55de3658 (diff) | |
parent | 409392d66c3a6debe5bc69c0e2308209ac74ee35 (diff) | |
download | fuzzycat-24dcddc4e4cff744e7c0a964856329d2ac69601d.tar.gz fuzzycat-24dcddc4e4cff744e7c0a964856329d2ac69601d.zip |
Merge branch 'martin-matcher-class' into 'master'
turn "match_release_fuzzy" into a class
See merge request webgroup/fuzzycat!10
Diffstat (limited to 'notes')
-rw-r--r-- | notes/2021_11_fuzzycat_refactoring.md | 87 |
1 files changed, 87 insertions, 0 deletions
diff --git a/notes/2021_11_fuzzycat_refactoring.md b/notes/2021_11_fuzzycat_refactoring.md new file mode 100644 index 0000000..171cee3 --- /dev/null +++ b/notes/2021_11_fuzzycat_refactoring.md @@ -0,0 +1,87 @@ +# Proposal: Fuzzycat Refactoring + +* Goal: Refactor fuzzycat to make matching and verification more composable, + configurable and testable. +* Status: wip + +A better design. + +* it has a correct scope (e.g. match X; very Y) +* it has good defaults, but allows configuration +* it is clear how and where to extend functionality +* it is easy to add one new test for a case + +## Matching + +* fuzzy matching will be a cascade of queries, until a result is returned +* there is an order of queries from exact to very fuzzy +* alternatively, we could use "ensemble matching", that takes the intersection of a couple of queries +* ES queries cannot cover all cases, we need to add additional checks; e.g. author list comparison + +Example + + FuzzyReleaseMatcher + match_release_id + match_release_exact_title_exact_contrib + match_release_... + + match_release_fuzzy (runs a cascade of queries) + +Each function is testable on its own. The class keeps the es client and other +global config around. It's scope is clear: given a "release" (or maybe just a +title string), generate a list of potentially related releases. + +Other entities follow the same pattern. + + FuzzyContainerMatcher + match_container_id + match_container_issn + match_container_abbreviation + match_container_... + + match_container_fuzzy (runs a cascade of queries) + +A helper object (not exactly the entity) for matching list of authors. Allows +to match by various means, e.g. exact, short names, partial lists, etc. Should +account for case, order, etc. + + FuzzyContribsMatcher + match_exact + match_short_names + match_partial_list + + match_fuzzy + +For each method in each matcher class, we can construct a test case only for +one particular method. A new method can be added with easy and tested separately. + +Don't know how yet, but we can create some "profiles" that allow for a matching +by a set of methods. Or use good defaults on the higher level `_fuzzy(...)` method. + +NOTE: the matcher classes could use the verification code internally; generate +a list of matches with an es query, then use a configured verifier to generate +verified matches; only put comparison code into verification module. + +## Verification (comparison) + +Verification works similarly. For each entity we define a set of methods, verifying a specific aspect. + + FuzzyReleaseVerifier + verify_release_id + verify_release_ext_id + verify_release_title_exact_match + verify_release_title_contrib_exact_match + verify_release_... + + verify(a, b) -> (Status, Reason) + +A large number of test cases are already there, may need a bit better structure +to relate cases to methods. The class can hold global configuration, maybe some +cached computed properties, if that helps. + + + FuzzyContainerVerifier + verify_container_id + ... + + |