diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 18:32:12 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 18:32:12 +0200 |
commit | 38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (patch) | |
tree | 5efaa26b359615b395d620222e51c001e272cc85 | |
parent | 9aeacc07be8151a0d44d25cbe377c9f4a09a620a (diff) | |
download | fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.tar.gz fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.zip |
update README
-rw-r--r-- | README.md | 34 |
1 files changed, 25 insertions, 9 deletions
@@ -11,33 +11,43 @@ Scholar](https://scholar.google.com/scholar?q=fuzzy+matching) group publications into clusters. Each cluster represents one publication, abstracted from its concrete representation as a link to a PDF. -We call the abstract publication *work* and the concrete instance a *release*. -The goal is to group releases under works and to implement a versions feature. +We call the abstract publication +[work](https://guide.fatcat.wiki/entity_work.html) and the concrete instance a +[release](https://guide.fatcat.wiki/entity_release.html). One goal is to group +releases under works and to implement a versions feature (self-match). Another +goal is to have support for matching of external lists (e.g. title lists or +other document) to the existing records. This repository contains both generic code for matching as well as fatcat specific code using the fatcat openapi client. ## Approach -There are probably a few assumption we can make: +* Local code, with command line entry points for matching as well as adapter + for fatcat. + +A few assumption we need to make: * If two strings are given, an exact string match does not mean equality (at all), e.g. "Acta geographica" has currently eight associated ISSN, and a -title like "Buchbesprechungen" appears many hundreds of times. -* ... -* ... +title like "Buchbesprechungen" appears many hundreds of times. We need a bit +more context for a decision. ## Datasets -* release and container metadata from: [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05). +Relevant datasets are: + +* release and container metadata from a bulk fatcat export, e.g. [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05) * issn journal level data, via [issnlister](https://github.com/miku/issnlister) -* abbreviation lists +* journal abbreviation lists ## Matching approaches ![](static/approach.png) -## Performance data point +## Performance data points + +### Against elasticsearch Candidate generation via elasticsearch, 40 parallel queries, sustained speed at about 17857 queries per hour, that is around 5 queries/s. @@ -52,6 +62,12 @@ user 29177m5.516s sys 4927m3.277s ``` +### Without a search index + +Candidate grouping for self-match can be done locally by extracting a key per +document, then a group by (via sort and uniq). Clustering 150M docs took about +607min (around 4k docs/s, no verification step). + ## Data issues ### A republished article |