aboutsummaryrefslogtreecommitdiffstats
path: root/notes
diff options
context:
space:
mode:
authorMartin Czygan <martin@archive.org>2021-11-16 19:06:26 +0000
committerMartin Czygan <martin@archive.org>2021-11-16 19:06:26 +0000
commit24dcddc4e4cff744e7c0a964856329d2ac69601d (patch)
treead8650892805e55ec4a6958f9e1539eb675332b8 /notes
parent282f315c6ba3643c8c614220ab2f7e1d55de3658 (diff)
parent409392d66c3a6debe5bc69c0e2308209ac74ee35 (diff)
downloadfuzzycat-24dcddc4e4cff744e7c0a964856329d2ac69601d.tar.gz
fuzzycat-24dcddc4e4cff744e7c0a964856329d2ac69601d.zip
Merge branch 'martin-matcher-class' into 'master'
turn "match_release_fuzzy" into a class See merge request webgroup/fuzzycat!10
Diffstat (limited to 'notes')
-rw-r--r--notes/2021_11_fuzzycat_refactoring.md87
1 files changed, 87 insertions, 0 deletions
diff --git a/notes/2021_11_fuzzycat_refactoring.md b/notes/2021_11_fuzzycat_refactoring.md
new file mode 100644
index 0000000..171cee3
--- /dev/null
+++ b/notes/2021_11_fuzzycat_refactoring.md
@@ -0,0 +1,87 @@
+# Proposal: Fuzzycat Refactoring
+
+* Goal: Refactor fuzzycat to make matching and verification more composable,
+ configurable and testable.
+* Status: wip
+
+A better design.
+
+* it has a correct scope (e.g. match X; very Y)
+* it has good defaults, but allows configuration
+* it is clear how and where to extend functionality
+* it is easy to add one new test for a case
+
+## Matching
+
+* fuzzy matching will be a cascade of queries, until a result is returned
+* there is an order of queries from exact to very fuzzy
+* alternatively, we could use "ensemble matching", that takes the intersection of a couple of queries
+* ES queries cannot cover all cases, we need to add additional checks; e.g. author list comparison
+
+Example
+
+ FuzzyReleaseMatcher
+ match_release_id
+ match_release_exact_title_exact_contrib
+ match_release_...
+
+ match_release_fuzzy (runs a cascade of queries)
+
+Each function is testable on its own. The class keeps the es client and other
+global config around. It's scope is clear: given a "release" (or maybe just a
+title string), generate a list of potentially related releases.
+
+Other entities follow the same pattern.
+
+ FuzzyContainerMatcher
+ match_container_id
+ match_container_issn
+ match_container_abbreviation
+ match_container_...
+
+ match_container_fuzzy (runs a cascade of queries)
+
+A helper object (not exactly the entity) for matching list of authors. Allows
+to match by various means, e.g. exact, short names, partial lists, etc. Should
+account for case, order, etc.
+
+ FuzzyContribsMatcher
+ match_exact
+ match_short_names
+ match_partial_list
+
+ match_fuzzy
+
+For each method in each matcher class, we can construct a test case only for
+one particular method. A new method can be added with easy and tested separately.
+
+Don't know how yet, but we can create some "profiles" that allow for a matching
+by a set of methods. Or use good defaults on the higher level `_fuzzy(...)` method.
+
+NOTE: the matcher classes could use the verification code internally; generate
+a list of matches with an es query, then use a configured verifier to generate
+verified matches; only put comparison code into verification module.
+
+## Verification (comparison)
+
+Verification works similarly. For each entity we define a set of methods, verifying a specific aspect.
+
+ FuzzyReleaseVerifier
+ verify_release_id
+ verify_release_ext_id
+ verify_release_title_exact_match
+ verify_release_title_contrib_exact_match
+ verify_release_...
+
+ verify(a, b) -> (Status, Reason)
+
+A large number of test cases are already there, may need a bit better structure
+to relate cases to methods. The class can hold global configuration, maybe some
+cached computed properties, if that helps.
+
+
+ FuzzyContainerVerifier
+ verify_container_id
+ ...
+
+