notes on matching metrics

author: Martin Czygan <martin.czygan@gmail.com> 2021-07-08 17:48:39 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2021-07-08 17:48:39 +0200
commit: 6a97a067aa967d681c112f7d2ea1e02e038189ee (patch)
tree: 8550273ea9dd0154e2f2b22f1d3f4f3e6f0f752f
parent: 1d84fec0927a98e576a6525911252d334f0da48a (diff)
download: fuzzycat-6a97a067aa967d681c112f7d2ea1e02e038189ee.tar.gz
fuzzycat-6a97a067aa967d681c112f7d2ea1e02e038189ee.zip
1 files changed, 16 insertions, 0 deletions
diff --git a/notes/matching_metrics.md b/notes/matching_metrics.md
new file mode 100644
index 0000000..d37240f
--- /dev/null
+++ b/notes/matching_metrics.md
@@ -0,0 +1,16 @@
+# Matching Metrics
+
+## Precision/Recall
+
+For fuzzy matching we want to understand precision and recall. Options for test datasets:
+
+* manually curated (100s of examples); could determine
+* autogenerate slightly different set of real-world metadata (e.g. crossref vs. doaj) converted to releases
+* automatically distorted set of records; 1 original, plus N distorted (synthetic)
+
+## Overall numbers
+
+* number of clusters per clustering method: "title", "lowercase", "nysiis",
+  "sandcrawler", a few more - contrastive comparison of these cluster, e.g. how
+many more matches/non-matches we get for the various methods
+* take N docs from non-clusters and run verify; we would want 100% different/ambiguous results
author	Martin Czygan <martin.czygan@gmail.com>	2021-07-08 17:48:39 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2021-07-08 17:48:39 +0200
commit	6a97a067aa967d681c112f7d2ea1e02e038189ee (patch)
tree	8550273ea9dd0154e2f2b22f1d3f4f3e6f0f752f
parent	1d84fec0927a98e576a6525911252d334f0da48a (diff)
download	fuzzycat-6a97a067aa967d681c112f7d2ea1e02e038189ee.tar.gz fuzzycat-6a97a067aa967d681c112f7d2ea1e02e038189ee.zip