diff options
Diffstat (limited to 'notes')
-rw-r--r-- | notes/approach.dot | 13 | ||||
-rw-r--r-- | notes/approach.png | bin | 40516 -> 0 bytes | |||
-rw-r--r-- | notes/matching_metrics.md | 16 |
3 files changed, 16 insertions, 13 deletions
diff --git a/notes/approach.dot b/notes/approach.dot deleted file mode 100644 index 0bf3cbb..0000000 --- a/notes/approach.dot +++ /dev/null @@ -1,13 +0,0 @@ -digraph f { - "matching" -> "strings"; - "matching" -> "entities"; - - "strings" -> "lookups"; - "strings" -> "normalization"; - "strings" -> "fuzzy"; - - "entities" -> "identifiers"; - "entities" -> "field subsets"; - - "field subsets" -> "strings"; -} diff --git a/notes/approach.png b/notes/approach.png Binary files differdeleted file mode 100644 index cce18d7..0000000 --- a/notes/approach.png +++ /dev/null diff --git a/notes/matching_metrics.md b/notes/matching_metrics.md new file mode 100644 index 0000000..d37240f --- /dev/null +++ b/notes/matching_metrics.md @@ -0,0 +1,16 @@ +# Matching Metrics + +## Precision/Recall + +For fuzzy matching we want to understand precision and recall. Options for test datasets: + +* manually curated (100s of examples); could determine +* autogenerate slightly different set of real-world metadata (e.g. crossref vs. doaj) converted to releases +* automatically distorted set of records; 1 original, plus N distorted (synthetic) + +## Overall numbers + +* number of clusters per clustering method: "title", "lowercase", "nysiis", + "sandcrawler", a few more - contrastive comparison of these cluster, e.g. how +many more matches/non-matches we get for the various methods +* take N docs from non-clusters and run verify; we would want 100% different/ambiguous results |