summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2020-12-16 19:56:01 -0800
committerBryan Newbold <bnewbold@robocracy.org>2020-12-16 20:16:09 -0800
commit6d5811693c36b9e73dedf0205c40f2aed63e2870 (patch)
tree717de06d66ac009205a91cdeb511d113d61eac85 /python/fatcat_tools
parent38328c25674fee7781a8d8601e1d47de04186f19 (diff)
downloadfatcat-6d5811693c36b9e73dedf0205c40f2aed63e2870.tar.gz
fatcat-6d5811693c36b9e73dedf0205c40f2aed63e2870.zip
add fuzzy match filtering to DOAJ importer
In this default configuration, any entities with a fuzzy match (even "ambiguous") will be skipped at import time, to prevent creating duplicates. This is conservative towards not creating new/duplicate entities. In the future, as we get more confidence in fuzzy match/verification, we can start to ignore AMBIGUOUS, handle EXACT as same release, and merge STRONG (and WEAK?) matches under the same work entity.
Diffstat (limited to 'python/fatcat_tools')
-rw-r--r--python/fatcat_tools/importers/doaj_article.py11
1 files changed, 9 insertions, 2 deletions
diff --git a/python/fatcat_tools/importers/doaj_article.py b/python/fatcat_tools/importers/doaj_article.py
index 03752484..191a65d8 100644
--- a/python/fatcat_tools/importers/doaj_article.py
+++ b/python/fatcat_tools/importers/doaj_article.py
@@ -217,9 +217,16 @@ class DoajArticleImporter(EntityImporter):
return False
break
- # TODO: in the future could do fuzzy match here, eg using elasticsearch
+ if not existing and self.do_fuzzy_match:
+ fuzzy_result = self.match_existing_release_fuzzy(re)
+ # TODO: in the future, could assign work_id for clustering, or for
+ # "EXACT" match, set existing and allow (optional) update code path
+ # to run
+ if fuzzy_result is not None:
+ self.counts["exists-fuzzy"] += 1
+ return False
- # create entity
+ # if no fuzzy existing match, create entity
if not existing:
return True