From 41f5b4488933e767f3d105fd9a05b557d7152a62 Mon Sep 17 00:00:00 2001 From: Bryan Newbold Date: Tue, 10 Nov 2020 18:52:37 -0800 Subject: fuzzy matching notes --- notes/fuzzy_match_notes.md | 148 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 notes/fuzzy_match_notes.md diff --git a/notes/fuzzy_match_notes.md b/notes/fuzzy_match_notes.md new file mode 100644 index 0000000..a87364c --- /dev/null +++ b/notes/fuzzy_match_notes.md @@ -0,0 +1,148 @@ + +These are notes on how bibliographic metadata matches (of records) and +slugification (to create lookup keys on title strings) worked in the past in +the sandcrawler repository. Eg, circa 2018. + +## Scala Slug-ification + +Original title strings longer than 1023 characters were rejected (before +slug-ifying). + +There was a "slug-denylist". Additionally, scorable strings needed to be +between 8 and 1023 characters (not bytes) long (inclusive) + +Slugification transform was: + +- lower-case +- remove whitespace ("\s") +- strip specific accent characters: + '\u0141' -> 'L', + '\u0142' -> 'l', // Letter ell + '\u00d8' -> 'O', + '\u00f8' -> 'o' +- remove all '\p{InCombiningDiacriticalMarks}' +- remove punctuation: + \p{Punct} + ’·“”‘’“”«»「」¿–±§ + +Partially adapted from apache commons: + +My original notes/proposal: + +1. keep only \p{Ideographic}, \p{Alphabetic}, and \p{Digit} +2. strip accents +3. "lower-case" (unicode-aware) +4. do any final custom/manual mappings + +Resulting slugs less than 8 characters long were rejected, and slugs were +checked against a denylist. + +Only 554 entries in the denylist; could just ship that in the library. + + +## Python Tokenization + +- "'" -> "'" +- remove non "isalnum()" characters +- encode as ASCII; this removes diacritics etc, but also all non-latin character sets +- optionally remove all whitespace + + +## Python GROBID Cleanups + +These are likely pretty GROBID-specific. Article title was required, but any of +the other filtered-out fields just resulted in partial metadata. These filters +are the result of lots of manual verification of results, and doing things like +taking truncating titles and looking at the most popular prefixes for a large +random sample. + +Same denylist for title slugs as Scala, plus: + + editorial + advertisement + bookreviews + reviews + nr + abstractoriginalarticle + originalarticle + impactfactor + articlenumber + +Other filters on title strings (any of these bad): + +- 500 or more characters long +- tokenized string less than 10 characters +- tokenized starts with 'nr' or 'issn' +- lowercase starts with 'int j' or '.int j' +- contains both "volume" and "issue" +- contains "downloadedfrom" +- fewer than 2 or more than 50 tokens (words) +- more than 12 tokens only a single character long +- more than three ":"; more than one "|"; more than one "." + +Remove title prefixes (but allow): + + "Title: " + "Original Article: " + "Original Article " + "Article: " + +Denylist for authors: + + phd + phdstudent + +Journal name processing: + +- apply title denylist +- remove prefixes + characters: /~&© + Original Research Article + Original Article + Research Article + Available online www.jocpr.com +- remove suffixes + Available online at www.sciarena.com + Original Article + Available online at + ISSN + ISSUE +- remove anywhere + e-ISSN + p-ISSN + +## Python Grouping Comparison + +Would consume joined groups, row-by-row. At most 10 matches per group; any more +and skip (this was for file-to-release). + +Overall matching requirements: + +- string similarity threshold from scala code + https://oldfashionedsoftware.com/2009/11/19/string-distance-and-refactoring-in-scala/ + https://stackoverflow.com/questions/955110/similarity-string-comparison-in-java/16018452#16018452 +- authors should be consistent + - convert one author list into space-separated tokens + - remove "jr." from all author token lists + - the last word for each author full name in the other list (eg, the lastname), + tokenized, must be in the token set +- if both years defined, then must match exactly (integers) + +In the code, there is a note: + + Note: the actual importer/merger should filter the following patterns out: + - container title has "letter" and "diar" + - contribs (authors) contain "&NA;" + - dates differ (not just year) + + +## Scala Metadata Keys + +Only the titles were ever actually used (in scala), but the keys allowed were: + +- title +- authors (list of strings) +- year (int) +- contentType +- doi + -- cgit v1.2.3