From 4ffc55fe5a408a4a62cdcc680ff9c6fc93b3f5f5 Mon Sep 17 00:00:00 2001 From: bnewbold Date: Wed, 1 Apr 2020 21:52:08 +0000 Subject: proposal: fuzzy matching --- proposals/2020_fuzzy_matching.md | 430 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 430 insertions(+) create mode 100644 proposals/2020_fuzzy_matching.md diff --git a/proposals/2020_fuzzy_matching.md b/proposals/2020_fuzzy_matching.md new file mode 100644 index 00000000..30c321e3 --- /dev/null +++ b/proposals/2020_fuzzy_matching.md @@ -0,0 +1,430 @@ + +Status: planned + +Bibliographic Entity Fuzzy Match and Verification +==================================================== + +This document summarizes work so far on entity "fuzzy matching" and match +verification, notes some specific upcoming use cases for such methods, and +proposes new generic routines to be implemented. + +There are three main routines desired: + +**Match verification:** given two bibliographic metadata records (or two entire +entities), confirm whether they refer to the exact same version of the entity. +Optionally, also determine if they are not the exact same entity, but are +variations of the same entity (eg, releases that should be grouped under a +shared work entity). Performance is not as critical as correctness. Should be a +pure, in-memory routine (not touch the network or external services). + +**Fuzzy matching:** of bibliographic metadata to entity in current/live Fatcat +catalog. Eg, given complete or partial metadata about a paper or other entity, +return a list of candidate matches. Emphasis on recall over precision; better +to return false matches than missed records. This would likely hit the fatcat +elasticsearch indexes (search.fatcat.wiki), which are continuously updated to +be in sync with the catalog itself (API). Expected to scale to, eg, 5-10 +lookups per second and operate over up to a couple million entities during +cleanup or merging operations. Should operate per record, on-demand, always +up-to-date (aka, not a batch process). + +**Bulk fuzzy match:** a tool capable of matching hundreds or billions of raw +metadata records against the entire fatcat catalog, resulting in candidate +fuzzy match pairs or groupings. The two main applications are de-duplicating or +grouping releases over the entire catalog (aka, matching the catalog against +itself), matching billions of citations (structured partial metadata) against +release entities, and grouping citations which fail to match to existing +entities against each other. Likely to include a batch mode, though may also +include an efficient per-record lookup API as well. + +As a terminology note, the outputs of a fuzzy match are called "candidate +matches", and the output of match verificatino would be "confirmed match" or +"confident match". A "self-match" is when an exact complete entity is compared +to a set it already exists in, and matches to itself (aka, fatcat entity +identifier is exact match). "Identifier matches" are an easy case when external +identifiers (like DOI, ISSN, ORCID) are used for matching; note that the Fatcat +lookup API always returns the first such match, when there may actually be +multiple entities with the same external identifier, so this may not always be +sufficient. + + +## Container Matching + +Containers are a simpler case, so will discuss those first. Function signatures +would look like: + + verify_container_match(left:container, right:container) -> status:str + match_container_fuzzy(record:??) -> [candidate:container] + +Verify statuses could be: + +- `exact`: all metadata fields match byte-for-byte +- `strong`: good confidence these are the same +- `weak`: could be a match, perhaps missing enough metadata to be confident +- `ambiguous`: not enough confidence that is or even isn't a match +- `not-match`: confident that these are *not* the same container (TODO: better word?) + +Alternatively, could return a confidence floating point number, but that puts +burdern of interpretation on external code. + +Fields of interest would be at least: + +- name +- original name +- aliases +- publisher +- abbrev + +Some test cases that match verification should probably handle (just making +these up): + +- differences in title whitespace and capitalization +- title of one record has ISSN appended ("PLOS One" matches "PLOS One + 1932-6203") +- title of one record has publisher appended ("PLOS One" matches "PLOS One + (Public Library of Science)") +- record with original (non-English) name in name field matches fatcat record + where the non-Enlish title is in `original_title` field +- difference of "The" doesn't matter ("Lancet" matches "The Lancet") +- detect and reject bogus names + +Nice to haves: + +- records with only abbreviation or acroynm match (for "official" abbreviations + and acronyms, which may need to be recorded as alias). +- records of full name match when the "official" title is now an acroynum (eg, + "British Medical Journal" matches "The BMJ") +- optionally, detect and return `ambiguous` for a fixed list of known-ambiguous + names (eg, two journals with very similar titles, name or acronym along can't + distinguish) + +Can't remember if we already have aliases stored in container entity `extra` +yet, but idea is to store "also known as" names to help with fuzzy matching. + +The main application of these routines would be adding container linkage to +releases where we have a container name but no identifier (eg, an ISSN). We +have millions of releases (many with DOIs) that have a container name but no +linked container. We also want to import millions of papers based on GROBID +metadata alone, where we will likely only extract a journal name, and will want +to do lookups on that. + +For the particular case of containers, we'll probably want to memoize results +(aka, cache lookups) for efficiency. Probably will also want a unified helper +function like: + + match_container(name:str, **kwargs) -> (status:str, match:Option) + +which would handle caching (maybe this is a class/object not a function?), call +`match_container_fuzzy()` to get candidates, call `verify_container_match()` on +each candidate, and ensure there is only one `exact` or `strong` match. Eg, if +there are multiple `weak` matches, should return `ambiguous` for the group +overall. + +Another speculative application of these routines could be as part of chocula, +to try and find "true" ISSNs for records where we know the ISSN is not +registered or invalid (eg, ISSN checksum digit fails). Another would be to look +for duplicate container/ISSNs, where there are multiple ISSNs for the same +journal but only one has any papers actually published; in that case we would +maybe want to mark the other container entity as a "stub". + +Test datasets: + +- in kibana/ES, filter for "`container_id` does not exist" and + "`container_name` exists" +- in sandcrawler, dump `grobid` table and look in the metadata snippet, which + includes extracted journal info + + +## Release Matching (verification and fuzzy, not bulk fuzzy) + +For release entities, the signature probably looks something like: + + verify_release_match(left:release, right:release) -> status:str + how confident, based on metadata, that these are same release, or + should be grouped under same work? + exact_release_match(left:release, right:release) -> bool + exact_work_match(left:release, right:release) -> bool + +Instead of full release entities, we could use partial metadata in the form of +a python dataclass or named tuple. If we use a struct, will need a routine to +convert full entities to the partial struct; if we use full entities, will +probably want a helper to construct entities from partial metadata. + +Don't know which fields would be needed, but guessing at least the following, +in priority order: + +- if full entity: `ident` (and `work_id` for case of releases) +- `title` +- `authors` (at least list of surnames, possibly also raw and given names) +- `ext_ids` +- `release_year` (and full `release_date` if available) +- volume, issue, pages +- `subtitle` +- `original_title` +- container name (and `container_id` if available) +- `release_type` and `release_stage` (particularly for work/release grouping + vs. match) + +For releases, the "status" output could be: + +- `strong` (good confidence) +- `weak` (seems good, but maybe not enough metadata to be sure) +- `work-strong` (good confidence that same work, but not same release) +- `work-weak` +- `ambiguous` +- `not-match` (name?) + +Cases we should handle: + +- ability to filter out self-matches from fuzzy matcher. Eg, if we want to find + a duplicate of an existing release, should be able to filter out self-match + (as a kwarg?) +- ability to filter out entire categories of matches. eg, match only to + published works (`release_stage`), only to papers (`release_type`), only not + match to longtail OA papers (`is_longtail_oa` ES flag) +- have an in-memory stoplist for ambiguous titles (eg, "abstract", "index") +- differences in punctuation, capitalization +- single-character typos +- subtitle/title matching (eg: {title: "Important Research", subtitle: "A + History"} should match {title: "Important Research: A History", subtitle: + None}) +- record with only one author, surname only should be able to match against + full author list. maybe a flag to indicate this case? eg, for matching some + citation styles where only one author is listed +- years are +/- 1. certainly for pre-prints ("work matches"), but also + different databases sometimes have different semantics about whether + "publication date" or "submission date" or "indexed date" are used as the + `release_date` (which we consider the "publication date"). Maybe "weak" match + in this case +- (probably a lot more i'm not thinking of) + +Nice to have (but harder?): + +- journal/container matching where one side has an abbreviation, acronym, or + alias of the other. does this require a network fetch? maybe cached or + pre-loaded? may not be important, do testing first. Note that entire + container entity is transcluded when doing full release entity API lookups. + This may be important for citation matching/verification + +Will probably want a helper function to check that a metadata record (or +entity) has enough metadata and seems in-scope for matching. For example, a +record with title of just "Abstract" should probably be blocked from any match +attempt, because there are so many records with that metadata. On the other +hand, if there is an external identifier (eg, DOI), could still attempt a +direct match. Maybe something like: + + can_verify_match(release:release) -> boolean + +Test datasets: + +- PDFs crawled as part of longtail OA crawls (see context below). We have both + fatcat release already imported, and new GROBID-extracted works. Can use + glutton fuzzy matching for comparison, or verify those glutton matches +- many, many unmatched reference strings. will do separate + +Optionally, we could implement a glutton-compatible API with equivalent or +better performance for doing GROBID "header consolidation". If performance is +great, could even use it for reference conslidation at crawl/ingest time. + + +## Bulk Fuzzy + +Current concept for this is to implement Semantic Scholar's algorithm +(described below) in any of python, golang, or rust, to function for all these +use-cases: + +- grouping releases in fatcat which are variants of the same work (or, eg, + publisher registered multiple DOIs for same paper by accident) into the same + work +- reference matching from structured partial metadata to fatcat releases +- grouping of unmatched references (from reference matching) as structured + partial metadata, with the goal of finding, eg, the "100,000 most cited + papers which do not have a fatcat entity", and creating fatcat entities for + them + +Optionally, we could also architect/design this tool to replace biblio-glutton +for ingest-time "reference consolidation", by exposing a biblio-glutton +compatible API. If this isn't possible or hard it could become a later tool +instead. Eg, shouldn't sacrafice batch performance for this. In particular, for +ingest-time reference matching we'd want the backing corpus to be updated +continuously, which might be tricky or in conflict with batch-mode design. + +## Existing Fatcat Work + +### Hadoop matching pipeline + +In Summer 2018, the fatcat project had a wonderful volunteer, Ellen Spertus. +Ellen implemented a batch fuzzy matching pipeline in Scala (using the Scalding +map/reduce framework) that ran on our Hadoop cluster. We used it to "join" +GROBID metadata from PDF files against a Crossref metadata dump. + +The Scala job worked by converting input metadata records into simple "bibjson" +metadata subsets (plus keeping the original record identifiers, eg Crossref DOI +or PDF file hash, for later import). It created a key for each record by +normalizing the title (removing all whitespace and non-alphanumeric characters, +lower-casing, etc; we called this a "slug"), and filtering out keys from a +blocklist. We used the map/reduce framework to then join the two tables on +these keys, and then filtered the output pairs with a small bit of addiitonal +title similarity comparison logic, then dumped the list of pairs as a table to +Hadoop, sorted by join key (slug). I can't remember if we had any other +heuristics in the Scala code. + +This was followed by a second processing stage in python, which iterated over +the full list. It grouped pairs by key, and discarded any groups that had too +many pairs (on the assumption that the titles were too generic). It then ran +additional quality checks (much easier/faster to implement in python) on year, +author names (eg, checking that the number of authors matched). The output of +this filtering was then fed into a fatcat importer which matched the PDFs to +releases based on DOI. + +We got several million matches using this technique. In the end we really only +ran this pipeline once. Hadoop (and HBase in particular) ended up being +frustrating to work with, as jobs took hours or days to run, and many bugs +would only appear when run against the full dataset. A particular problem that +came up with the join approach was N^2 explosions for generic titles, where the +number of join rows would get very large (millions of rows) for generic titles, +even after we filtered out the top couple hundred most popular join keys. + +TODO: should update this section with specific algorithms and parameters used by +reading the Scala and Python source + +## Longtail OA Import Filtering + +Not direcly related to matching, but filtering mixed-quality metadata. + +As part of Longtail OA preservation work, we ran a crawl of small OA journal +websites, and then ran GROBID over the resulting PDFs to extract metadata. We +then filtered the output metadata using quality heuristics, then inserted both +new releases (with no external identifiers, just the extracted metadata) and +the associated file. + +The metadata filtering pipeline is interesting because of all the bad metadata +it detected. Eg, long titles, used the normalization and blocklist from the +Hadoop matching work, poor author metadata, etc. + +A big problem that was only noticed after this import was that actually many of +the papers imported were duplicates of existing fatcat papers with DOIs or +other identifiers. This was because the crawl ended up spidering some general +purpose repositories, which contained copies of large-publisher OA papers (eg, +PLOS papers). The solution to this will use fuzzy matching in two ways. First, +for future imports of this type, we will fuzzy matches against the catalog to +check that there isn't already a metadata record; possibly link the file to +matched existing entities (based on confidence), but certainly don't create new +records unless sure that there isn't an existing one (`no-match`). Also need to +ensure there are not duplicates *within* each import batch, but either running +the import slowly in a single thread (so elasticsearch and the matching system +has time to synchronize), or doing a batch fuzzy match first. Or possibly some +other pre-filtering idea. Secondly, we can go back over these longtail OA works +(which are tagged as such in the fatcat catalog) and attempt to match them +against non-longtail fatcat releases (eg, those with existing PMID or DOI), and +merge the releases together (note: redirecting the longtail release to the full +one, resulting in a single release, not just doing work grouping of releases). + +A separate problem from this import is that none of the papers have container +linkage (though many or all have container names successfully extracted). After +doing release-level merging, we should use container fuzzy matching to update +container linkage for these entities. We could potentially do this as a two +stage project where we dump all the container name strings, prioritize them by +release count, and iterate on container matching until we are happy with the +results, then run the actual release updates. + +### biblio-glutton + +[biblio-glutton][biblio-glutton] is a companion tool for GROBID to do record +matching and metadata enrichment of both "header" metadata (aka, extracted +metadata about the fulltext paper in the PDF itself) and references (extracted +from the reference/bibliography section of the fulltext paper). GROBID calls +this "consolidation". + +biblio-glutton supports a couple different metadata index sources, including +Crossref dumps. IA has patched both GROBID and biblio-glutton to work with +fatcat metadata directly, and to embed fatcat release identifiers in output +TEI-XML when there is a match. We have found performance to be fine for header +consolidation, but too slow for reference consolidation. This is because there +is only one glutton lookup per PDF for header mode, vs 20-50 lookups per PDF for +reference consolidation, which takes longer than the overall PDF extraction. In +our default configuration, we only do header consolidation. + +biblio-glutton runs as a REST API server, which can be queried separately from +the GROBID integration. It uses the JVM (can't remember if Java or Scala), and +works by doing an elasticsearch query to find candidates, then selects the best +candidate (if any) and looks up full metadata records from one or more LMDB +key/value databases. It requires it's own elasticsearch index with a custom +schema, and large LMDB files on disk (should be SSD for speed). The process of +updating the LMDB files and elasticsearch index are currently manual (with some +scripts), generated using bulk metadata fatcat dumps. This means the glutton +results get out of sync from the fatcat catalog. + +The current update process involves stopping glutton (which means stopping all +GROBID processing) for a couple hours. Compare this to the fatcat search index +(search.fatcat.wiki), which is continuously updated from the changelog feed, +and even during index schema changes has zero (or near zero) downtime. + +[biblio-glutton]: https://github.com/kermitt2/biblio-glutton + +## Existing External Work and Reading + +"The Lens MetaRecord and LensID: An open identifier system for aggregated +metadata and versioning of knowledge artefacts" + + + +### Semantic Scholar + +Semantic Scholar described their technique for doing bulk reference matching +and entity merging, summarized here: + +Fast candidate lookups are done using only a subset of the title. Title strings +are turned in to an array of normalized tokens (words), with stopwords (like +"a", "the") removed. Two separate indices are created: one with key as the the +first three tokens, and the other with the key as the last three tokens. For +each key, there will be a bucket of many papers (or just paper global +identifiers). A per-paper lookup by title will fetch candidates from both +indices. It is also possible to iterate over both indices by bucket and doing +further processing between all the papers, then combined the matches/groups +from both iterations. The reason for using two indices is to be robust against +mangled metadata where there is added junk or missing words at either the +begining or end of the title. + +To verify candidate pairs, the Jaccard similarity is calculated between the +full original title strings. This flexibly allows for character typos (human or +OCR), punctuation differences, and for longer titles missing or added words. +Roughly, the amount of difference is proportional to the total length of the +strings, so short titles must be a near-exact match, while long titles can have +entire whole words different. + +In addition to the title similarity check, only the first author surnames and +year of publication are further checked as heuristics to confirm matches. If I +recall correctly, not even the journal name is is compared. They have done both +performance and correctness evaluation of this process and are happy with the +results, particularly for reference matching. + +My (Bryan) commentary on this is that it is probably a good thing to try +implementing for bulk fuzzy candidate generation. I have noticed metadata +issues on Semantic Scholar with similarly titles papers getting grouped, or +fulltext PDF copies getting matched to the wrong paper. I think for fatcat we +should be more conservative, which we can do in our match verification +function. + +### Crossref + +Crossref has done a fair amount of work on the reference matching problem and +detecting duplicate records. In particular, Dominika Tkaczyk has published a +number of papers and blog posts: + + https://fatcat.wiki/release/search?q=author%3A%22Dominika+Tkaczyk%22 + https://www.crossref.org/authors/dominika-tkaczyk/ + +Some specific posts: + +"Double trouble with DOIs" + + +"Reference matching: for real this time" + + +Java implementation and testing/evaluation framework for their reference +matcher: + +- +- + -- cgit v1.2.3