aboutsummaryrefslogtreecommitdiffstats
path: root/proposals/2020_fuzzy_matching.md
blob: e84c2bd28eab9c47f7dc496b067848df6d8587e9 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430

Status: planned

Bibliographic Entity Fuzzy Match and Verification
====================================================

This document summarizes work so far on entity "fuzzy matching" and match
verification, notes some specific upcoming use cases for such methods, and
proposes new generic routines to be implemented.

There are three main routines desired:

**Match verification:** given two bibliographic metadata records (or two entire
entities), confirm whether they refer to the exact same version of the entity.
Optionally, also determine if they are not the exact same entity, but are
variations of the same entity (eg, releases that should be grouped under a
shared work entity). Performance is not as critical as correctness. Should be a
pure, in-memory routine (not touch the network or external services).

**Fuzzy matching:** of bibliographic metadata to entity in current/live Fatcat
catalog. Eg, given complete or partial metadata about a paper or other entity,
return a list of candidate matches. Emphasis on recall over precision; better
to return false matches than missed records. This would likely hit the fatcat
elasticsearch indexes (search.fatcat.wiki), which are continuously updated to
be in sync with the catalog itself (API). Expected to scale to, eg, 5-10
lookups per second and operate over up to a couple million entities during
cleanup or merging operations. Should operate per record, on-demand, always
up-to-date (aka, not a batch process).

**Bulk fuzzy match:** a tool capable of matching hundreds or billions of raw
metadata records against the entire fatcat catalog, resulting in candidate
fuzzy match pairs or groupings. The two main applications are de-duplicating or
grouping releases over the entire catalog (aka, matching the catalog against
itself), matching billions of citations (structured partial metadata) against
release entities, and grouping citations which fail to match to existing
entities against each other. Likely to include a batch mode, though may also
include an efficient per-record lookup API as well.

As a terminology note, the outputs of a fuzzy match are called "candidate
matches", and the output of match verificatino would be "confirmed match" or
"confident match". A "self-match" is when an exact complete entity is compared
to a set it already exists in, and matches to itself (aka, fatcat entity
identifier is exact match). "Identifier matches" are an easy case when external
identifiers (like DOI, ISSN, ORCID) are used for matching; note that the Fatcat
lookup API always returns the first such match, when there may actually be
multiple entities with the same external identifier, so this may not always be
sufficient.


## Container Matching

Containers are a simpler case, so will discuss those first. Function signatures
would look like:

    verify_container_match(left:container, right:container) -> status:str
    match_container_fuzzy(record:??) -> [candidate:container]

Verify statuses could be:

- `exact`: all metadata fields match byte-for-byte
- `strong`: good confidence these are the same
- `weak`: could be a match, perhaps missing enough metadata to be confident
- `ambiguous`: not enough confidence that is or even isn't a match
- `not-match`: confident that these are *not* the same container (TODO: better word?)

Alternatively, could return a confidence floating point number, but that puts
burdern of interpretation on external code.

Fields of interest would be at least:

- name
- original name
- aliases
- publisher
- abbrev

Some test cases that match verification should probably handle (just making
these up):

- differences in title whitespace and capitalization
- title of one record has ISSN appended ("PLOS One" matches "PLOS One
  1932-6203")
- title of one record has publisher appended ("PLOS One" matches "PLOS One
  (Public Library of Science)")
- record with original (non-English) name in name field matches fatcat record
  where the non-Enlish title is in `original_title` field
- difference of "The" doesn't matter ("Lancet" matches "The Lancet")
- detect and reject bogus names

Nice to haves:

- records with only abbreviation or acroynm match (for "official" abbreviations
  and acronyms, which may need to be recorded as alias).
- records of full name match when the "official" title is now an acroynum (eg,
  "British Medical Journal" matches "The BMJ")
- optionally, detect and return `ambiguous` for a fixed list of known-ambiguous
  names (eg, two journals with very similar titles, name or acronym along can't
  distinguish)

Can't remember if we already have aliases stored in container entity `extra`
yet, but idea is to store "also known as" names to help with fuzzy matching.

The main application of these routines would be adding container linkage to
releases where we have a container name but no identifier (eg, an ISSN). We
have millions of releases (many with DOIs) that have a container name but no
linked container. We also want to import millions of papers based on GROBID
metadata alone, where we will likely only extract a journal name, and will want
to do lookups on that.

For the particular case of containers, we'll probably want to memoize results
(aka, cache lookups) for efficiency. Probably will also want a unified helper
function like:

    match_container(name:str, **kwargs) -> (status:str, match:Option<container>)

which would handle caching (maybe this is a class/object not a function?), call
`match_container_fuzzy()` to get candidates, call `verify_container_match()` on
each candidate, and ensure there is only one `exact` or `strong` match. Eg, if
there are multiple `weak` matches, should return `ambiguous` for the group
overall.

Another speculative application of these routines could be as part of chocula,
to try and find "true" ISSNs for records where we know the ISSN is not
registered or invalid (eg, ISSN checksum digit fails). Another would be to look
for duplicate container/ISSNs, where there are multiple ISSNs for the same
journal but only one has any papers actually published; in that case we would
maybe want to mark the other container entity as a "stub".

Test datasets:

- in kibana/ES, filter for "`container_id` does not exist" and
  "`container_name` exists"
- in sandcrawler, dump `grobid` table and look in the metadata snippet, which
  includes extracted journal info


## Release Matching (verification and fuzzy, not bulk fuzzy)

For release entities, the signature probably looks something like:

    verify_release_match(left:release, right:release) -> status:str
        how confident, based on metadata, that these are same release, or
        should be grouped under same work?
    exact_release_match(left:release, right:release) -> bool
    exact_work_match(left:release, right:release) -> bool

Instead of full release entities, we could use partial metadata in the form of
a python dataclass or named tuple. If we use a struct, will need a routine to
convert full entities to the partial struct; if we use full entities, will
probably want a helper to construct entities from partial metadata.

Don't know which fields would be needed, but guessing at least the following,
in priority order:

- if full entity: `ident` (and `work_id` for case of releases)
- `title`
- `authors` (at least list of surnames, possibly also raw and given names)
- `ext_ids`
- `release_year` (and full `release_date` if available)
- volume, issue, pages
- `subtitle`
- `original_title`
- container name (and `container_id` if available)
- `release_type` and `release_stage` (particularly for work/release grouping
  vs. match)

For releases, the "status" output could be:

- `strong` (good confidence)
- `weak` (seems good, but maybe not enough metadata to be sure)
- `work-strong` (good confidence that same work, but not same release)
- `work-weak`
- `ambiguous`
- `not-match` (name?)

Cases we should handle:

- ability to filter out self-matches from fuzzy matcher. Eg, if we want to find
  a duplicate of an existing release, should be able to filter out self-match
  (as a kwarg?)
- ability to filter out entire categories of matches. eg, match only to
  published works (`release_stage`), only to papers (`release_type`), only not
  match to longtail OA papers (`is_longtail_oa` ES flag)
- have an in-memory stoplist for ambiguous titles (eg, "abstract", "index")
- differences in punctuation, capitalization
- single-character typos
- subtitle/title matching (eg: {title: "Important Research", subtitle: "A
  History"} should match {title: "Important Research: A History", subtitle:
  None})
- record with only one author, surname only should be able to match against
  full author list. maybe a flag to indicate this case? eg, for matching some
  citation styles where only one author is listed
- years are +/- 1. certainly for pre-prints ("work matches"), but also
  different databases sometimes have different semantics about whether
  "publication date" or "submission date" or "indexed date" are used as the
  `release_date` (which we consider the "publication date"). Maybe "weak" match
  in this case
- (probably a lot more i'm not thinking of)

Nice to have (but harder?):

- journal/container matching where one side has an abbreviation, acronym, or
  alias of the other. does this require a network fetch? maybe cached or
  pre-loaded? may not be important, do testing first. Note that entire
  container entity is transcluded when doing full release entity API lookups.
  This may be important for citation matching/verification

Will probably want a helper function to check that a metadata record (or
entity) has enough metadata and seems in-scope for matching. For example, a
record with title of just "Abstract" should probably be blocked from any match
attempt, because there are so many records with that metadata. On the other
hand, if there is an external identifier (eg, DOI), could still attempt a
direct match. Maybe something like:

    can_verify_match(release:release) -> boolean

Test datasets:

- PDFs crawled as part of longtail OA crawls (see context below). We have both
  fatcat release already imported, and new GROBID-extracted works. Can use
  glutton fuzzy matching for comparison, or verify those glutton matches
- many, many unmatched reference strings. will do separate

Optionally, we could implement a glutton-compatible API with equivalent or
better performance for doing GROBID "header consolidation". If performance is
great, could even use it for reference conslidation at crawl/ingest time.


## Bulk Fuzzy

Current concept for this is to implement Semantic Scholar's algorithm
(described below) in any of python, golang, or rust, to function for all these
use-cases:

- grouping releases in fatcat which are variants of the same work (or, eg,
  publisher registered multiple DOIs for same paper by accident) into the same
  work
- reference matching from structured partial metadata to fatcat releases
- grouping of unmatched references (from reference matching) as structured
  partial metadata, with the goal of finding, eg, the "100,000 most cited
  papers which do not have a fatcat entity", and creating fatcat entities for
  them

Optionally, we could also architect/design this tool to replace biblio-glutton
for ingest-time "reference consolidation", by exposing a biblio-glutton
compatible API. If this isn't possible or hard it could become a later tool
instead. Eg, shouldn't sacrifice batch performance for this. In particular, for
ingest-time reference matching we'd want the backing corpus to be updated
continuously, which might be tricky or in conflict with batch-mode design.

## Existing Fatcat Work

### Hadoop matching pipeline

In Summer 2018, the fatcat project had a wonderful volunteer, Ellen Spertus.
Ellen implemented a batch fuzzy matching pipeline in Scala (using the Scalding
map/reduce framework) that ran on our Hadoop cluster. We used it to "join"
GROBID metadata from PDF files against a Crossref metadata dump.

The Scala job worked by converting input metadata records into simple "bibjson"
metadata subsets (plus keeping the original record identifiers, eg Crossref DOI
or PDF file hash, for later import). It created a key for each record by
normalizing the title (removing all whitespace and non-alphanumeric characters,
lower-casing, etc; we called this a "slug"), and filtering out keys from a
blocklist. We used the map/reduce framework to then join the two tables on
these keys, and then filtered the output pairs with a small bit of addiitonal
title similarity comparison logic, then dumped the list of pairs as a table to
Hadoop, sorted by join key (slug). I can't remember if we had any other
heuristics in the Scala code.

This was followed by a second processing stage in python, which iterated over
the full list. It grouped pairs by key, and discarded any groups that had too
many pairs (on the assumption that the titles were too generic). It then ran
additional quality checks (much easier/faster to implement in python) on year,
author names (eg, checking that the number of authors matched). The output of
this filtering was then fed into a fatcat importer which matched the PDFs to
releases based on DOI.

We got several million matches using this technique. In the end we really only
ran this pipeline once. Hadoop (and HBase in particular) ended up being
frustrating to work with, as jobs took hours or days to run, and many bugs
would only appear when run against the full dataset. A particular problem that
came up with the join approach was N^2 explosions for generic titles, where the
number of join rows would get very large (millions of rows) for generic titles,
even after we filtered out the top couple hundred most popular join keys.

TODO: should update this section with specific algorithms and parameters used by
reading the Scala and Python source

## Longtail OA Import Filtering

Not directly related to matching, but filtering mixed-quality metadata.

As part of Longtail OA preservation work, we ran a crawl of small OA journal
websites, and then ran GROBID over the resulting PDFs to extract metadata. We
then filtered the output metadata using quality heuristics, then inserted both
new releases (with no external identifiers, just the extracted metadata) and
the associated file.

The metadata filtering pipeline is interesting because of all the bad metadata
it detected. Eg, long titles, used the normalization and blocklist from the
Hadoop matching work, poor author metadata, etc.

A big problem that was only noticed after this import was that actually many of
the papers imported were duplicates of existing fatcat papers with DOIs or
other identifiers. This was because the crawl ended up spidering some general
purpose repositories, which contained copies of large-publisher OA papers (eg,
PLOS papers). The solution to this will use fuzzy matching in two ways. First,
for future imports of this type, we will fuzzy matches against the catalog to
check that there isn't already a metadata record; possibly link the file to
matched existing entities (based on confidence), but certainly don't create new
records unless sure that there isn't an existing one (`no-match`). Also need to
ensure there are not duplicates *within* each import batch, but either running
the import slowly in a single thread (so elasticsearch and the matching system
has time to synchronize), or doing a batch fuzzy match first. Or possibly some
other pre-filtering idea. Secondly, we can go back over these longtail OA works
(which are tagged as such in the fatcat catalog) and attempt to match them
against non-longtail fatcat releases (eg, those with existing PMID or DOI), and
merge the releases together (note: redirecting the longtail release to the full
one, resulting in a single release, not just doing work grouping of releases).

A separate problem from this import is that none of the papers have container
linkage (though many or all have container names successfully extracted). After
doing release-level merging, we should use container fuzzy matching to update
container linkage for these entities. We could potentially do this as a two
stage project where we dump all the container name strings, prioritize them by
release count, and iterate on container matching until we are happy with the
results, then run the actual release updates.

### biblio-glutton

[biblio-glutton][biblio-glutton] is a companion tool for GROBID to do record
matching and metadata enrichment of both "header" metadata (aka, extracted
metadata about the fulltext paper in the PDF itself) and references (extracted
from the reference/bibliography section of the fulltext paper). GROBID calls
this "consolidation".

biblio-glutton supports a couple different metadata index sources, including
Crossref dumps. IA has patched both GROBID and biblio-glutton to work with
fatcat metadata directly, and to embed fatcat release identifiers in output
TEI-XML when there is a match. We have found performance to be fine for header
consolidation, but too slow for reference consolidation. This is because there
is only one glutton lookup per PDF for header mode, vs 20-50 lookups per PDF for
reference consolidation, which takes longer than the overall PDF extraction. In
our default configuration, we only do header consolidation.

biblio-glutton runs as a REST API server, which can be queried separately from
the GROBID integration. It uses the JVM (can't remember if Java or Scala), and
works by doing an elasticsearch query to find candidates, then selects the best
candidate (if any) and looks up full metadata records from one or more LMDB
key/value databases. It requires it's own elasticsearch index with a custom
schema, and large LMDB files on disk (should be SSD for speed). The process of
updating the LMDB files and elasticsearch index are currently manual (with some
scripts), generated using bulk metadata fatcat dumps. This means the glutton
results get out of sync from the fatcat catalog.

The current update process involves stopping glutton (which means stopping all
GROBID processing) for a couple hours. Compare this to the fatcat search index
(search.fatcat.wiki), which is continuously updated from the changelog feed,
and even during index schema changes has zero (or near zero) downtime.

[biblio-glutton]: https://github.com/kermitt2/biblio-glutton

## Existing External Work and Reading

"The Lens MetaRecord and LensID: An open identifier system for aggregated
metadata and versioning of knowledge artefacts"
<https://osf.io/preprints/lissa/t56yh/>


### Semantic Scholar

Semantic Scholar described their technique for doing bulk reference matching
and entity merging, summarized here:

Fast candidate lookups are done using only a subset of the title. Title strings
are turned in to an array of normalized tokens (words), with stopwords (like
"a", "the") removed. Two separate indices are created: one with key as the the
first three tokens, and the other with the key as the last three tokens. For
each key, there will be a bucket of many papers (or just paper global
identifiers). A per-paper lookup by title will fetch candidates from both
indices. It is also possible to iterate over both indices by bucket and doing
further processing between all the papers, then combined the matches/groups
from both iterations. The reason for using two indices is to be robust against
mangled metadata where there is added junk or missing words at either the
beginning or end of the title.

To verify candidate pairs, the Jaccard similarity is calculated between the
full original title strings. This flexibly allows for character typos (human or
OCR), punctuation differences, and for longer titles missing or added words.
Roughly, the amount of difference is proportional to the total length of the
strings, so short titles must be a near-exact match, while long titles can have
entire whole words different.

In addition to the title similarity check, only the first author surnames and
year of publication are further checked as heuristics to confirm matches. If I
recall correctly, not even the journal name is is compared. They have done both
performance and correctness evaluation of this process and are happy with the
results, particularly for reference matching.

My (Bryan) commentary on this is that it is probably a good thing to try
implementing for bulk fuzzy candidate generation. I have noticed metadata
issues on Semantic Scholar with similarly titles papers getting grouped, or
fulltext PDF copies getting matched to the wrong paper. I think for fatcat we
should be more conservative, which we can do in our match verification
function.

### Crossref

Crossref has done a fair amount of work on the reference matching problem and
detecting duplicate records. In particular, Dominika Tkaczyk has published a
number of papers and blog posts:

    https://fatcat.wiki/release/search?q=author%3A%22Dominika+Tkaczyk%22
    https://www.crossref.org/authors/dominika-tkaczyk/

Some specific posts:

"Double trouble with DOIs"
<https://www.crossref.org/blog/double-trouble-with-dois/>

"Reference matching: for real this time"
<https://www.crossref.org/blog/reference-matching-for-real-this-time/>

Java implementation and testing/evaluation framework for their reference
matcher:

- <https://gitlab.com/crossref/search_based_reference_matcher>
- <https://gitlab.com/crossref/reference_matching_evaluation_framework>