diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-07-08 17:48:39 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-07-08 17:48:39 +0200 |
commit | 6a97a067aa967d681c112f7d2ea1e02e038189ee (patch) | |
tree | 8550273ea9dd0154e2f2b22f1d3f4f3e6f0f752f | |
parent | 1d84fec0927a98e576a6525911252d334f0da48a (diff) | |
download | fuzzycat-6a97a067aa967d681c112f7d2ea1e02e038189ee.tar.gz fuzzycat-6a97a067aa967d681c112f7d2ea1e02e038189ee.zip |
notes on matching metrics
-rw-r--r-- | notes/matching_metrics.md | 16 |
1 files changed, 16 insertions, 0 deletions
diff --git a/notes/matching_metrics.md b/notes/matching_metrics.md new file mode 100644 index 0000000..d37240f --- /dev/null +++ b/notes/matching_metrics.md @@ -0,0 +1,16 @@ +# Matching Metrics + +## Precision/Recall + +For fuzzy matching we want to understand precision and recall. Options for test datasets: + +* manually curated (100s of examples); could determine +* autogenerate slightly different set of real-world metadata (e.g. crossref vs. doaj) converted to releases +* automatically distorted set of records; 1 original, plus N distorted (synthetic) + +## Overall numbers + +* number of clusters per clustering method: "title", "lowercase", "nysiis", + "sandcrawler", a few more - contrastive comparison of these cluster, e.g. how +many more matches/non-matches we get for the various methods +* take N docs from non-clusters and run verify; we would want 100% different/ambiguous results |