notes/fuzzy_match_notes.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148


These are notes on how bibliographic metadata matches (of records) and
slugification (to create lookup keys on title strings) worked in the past in
the sandcrawler repository. Eg, circa 2018.

## Scala Slug-ification

Original title strings longer than 1023 characters were rejected (before
slug-ifying).

There was a "slug-denylist". Additionally, scorable strings needed to be
between 8 and 1023 characters (not bytes) long (inclusive)

Slugification transform was:

- lower-case
- remove whitespace ("\s")
- strip specific accent characters:
    '\u0141' -> 'L',
    '\u0142' -> 'l',  // Letter ell
    '\u00d8' -> 'O',
    '\u00f8' -> 'o'
- remove all '\p{InCombiningDiacriticalMarks}'
- remove punctuation:
    \p{Punct}
    ’·“”‘’“”«»「」¿–±§

Partially adapted from apache commons: <https://git-wip-us.apache.org/repos/asf?p=commons-lang.git;a=blob;f=src/main/java/org/apache/commons/lang3/StringUtils.java;h=1d7b9b99335865a88c509339f700ce71ce2c71f2;hb=HEAD#l934>

My original notes/proposal:

1. keep only \p{Ideographic}, \p{Alphabetic}, and \p{Digit}
2. strip accents
3. "lower-case" (unicode-aware)
4. do any final custom/manual mappings

Resulting slugs less than 8 characters long were rejected, and slugs were
checked against a denylist.

Only 554 entries in the denylist; could just ship that in the library.


## Python Tokenization

- "&apos;" -> "'"
- remove non "isalnum()" characters
- encode as ASCII; this removes diacritics etc, but also all non-latin character sets
- optionally remove all whitespace


## Python GROBID Cleanups

These are likely pretty GROBID-specific. Article title was required, but any of
the other filtered-out fields just resulted in partial metadata. These filters
are the result of lots of manual verification of results, and doing things like
taking truncating titles and looking at the most popular prefixes for a large
random sample.

Same denylist for title slugs as Scala, plus:

    editorial
    advertisement
    bookreviews
    reviews
    nr
    abstractoriginalarticle
    originalarticle
    impactfactor
    articlenumber

Other filters on title strings (any of these bad):

- 500 or more characters long
- tokenized string less than 10 characters
- tokenized starts with 'nr' or 'issn'
- lowercase starts with 'int j' or '.int j'
- contains both "volume" and "issue"
- contains "downloadedfrom"
- fewer than 2 or more than 50 tokens (words)
- more than 12 tokens only a single character long
- more than three ":"; more than one "|"; more than one "."

Remove title prefixes (but allow):

    "Title: "
    "Original Article: "
    "Original Article "
    "Article: "

Denylist for authors:

    phd
    phdstudent

Journal name processing:

- apply title denylist
- remove prefixes
    characters: /~&©
    Original Research Article
    Original Article
    Research Article
    Available online www.jocpr.com
- remove suffixes
    Available online at www.sciarena.com
    Original Article
    Available online at
    ISSN
    ISSUE
- remove anywhere
    e-ISSN
    p-ISSN

## Python Grouping Comparison

Would consume joined groups, row-by-row. At most 10 matches per group; any more
and skip (this was for file-to-release).

Overall matching requirements:

- string similarity threshold from scala code
    https://oldfashionedsoftware.com/2009/11/19/string-distance-and-refactoring-in-scala/
    https://stackoverflow.com/questions/955110/similarity-string-comparison-in-java/16018452#16018452
- authors should be consistent
    - convert one author list into space-separated tokens
    - remove "jr." from all author token lists
    - the last word for each author full name in the other list (eg, the lastname),
      tokenized, must be in the token set
- if both years defined, then must match exactly (integers)

In the code, there is a note:

    Note: the actual importer/merger should filter the following patterns out:
    - container title has "letter" and "diar"
    - contribs (authors) contain "&NA;"
    - dates differ (not just year)


## Scala Metadata Keys

Only the titles were ever actually used (in scala), but the keys allowed were:

- title
- authors (list of strings)
- year (int)
- contentType
- doi