aboutsummaryrefslogtreecommitdiffstats
path: root/notes/matching_metrics.md
blob: d37240f0534a2902114617241fa8c15d393f293d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Matching Metrics

## Precision/Recall

For fuzzy matching we want to understand precision and recall. Options for test datasets:

* manually curated (100s of examples); could determine
* autogenerate slightly different set of real-world metadata (e.g. crossref vs. doaj) converted to releases
* automatically distorted set of records; 1 original, plus N distorted (synthetic)

## Overall numbers

* number of clusters per clustering method: "title", "lowercase", "nysiis",
  "sandcrawler", a few more - contrastive comparison of these cluster, e.g. how
many more matches/non-matches we get for the various methods
* take N docs from non-clusters and run verify; we would want 100% different/ambiguous results