aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2020_fuzzy_matching.md
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@robocracy.org>2021-11-24 15:48:01 -0800
committerBryan Newbold <bnewbold@robocracy.org>2021-11-24 15:48:01 -0800
commitd6b1d3de6224b590a82b175f78b761df1a6df4a2 (patch)
treecdc5904d0136432fbfa0500fe136897eea650c34 /proposals/2020_fuzzy_matching.md
parentcc0393de91301a29bd469e38519125a530b4472d (diff)
downloadfatcat-d6b1d3de6224b590a82b175f78b761df1a6df4a2.tar.gz
fatcat-d6b1d3de6224b590a82b175f78b761df1a6df4a2.zip
codespell fixes to proposals
Diffstat (limited to 'proposals/2020_fuzzy_matching.md')
-rw-r--r--proposals/2020_fuzzy_matching.md6
1 files changed, 3 insertions, 3 deletions
diff --git a/proposals/2020_fuzzy_matching.md b/proposals/2020_fuzzy_matching.md
index 30c321e3..e84c2bd2 100644
--- a/proposals/2020_fuzzy_matching.md
+++ b/proposals/2020_fuzzy_matching.md
@@ -244,7 +244,7 @@ use-cases:
Optionally, we could also architect/design this tool to replace biblio-glutton
for ingest-time "reference consolidation", by exposing a biblio-glutton
compatible API. If this isn't possible or hard it could become a later tool
-instead. Eg, shouldn't sacrafice batch performance for this. In particular, for
+instead. Eg, shouldn't sacrifice batch performance for this. In particular, for
ingest-time reference matching we'd want the backing corpus to be updated
continuously, which might be tricky or in conflict with batch-mode design.
@@ -289,7 +289,7 @@ reading the Scala and Python source
## Longtail OA Import Filtering
-Not direcly related to matching, but filtering mixed-quality metadata.
+Not directly related to matching, but filtering mixed-quality metadata.
As part of Longtail OA preservation work, we ran a crawl of small OA journal
websites, and then ran GROBID over the resulting PDFs to extract metadata. We
@@ -383,7 +383,7 @@ indices. It is also possible to iterate over both indices by bucket and doing
further processing between all the papers, then combined the matches/groups
from both iterations. The reason for using two indices is to be robust against
mangled metadata where there is added junk or missing words at either the
-begining or end of the title.
+beginning or end of the title.
To verify candidate pairs, the Jaccard similarity is calculated between the
full original title strings. This flexibly allows for character typos (human or