diff options
| author | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 18:32:12 +0200 | 
|---|---|---|
| committer | Martin Czygan <martin.czygan@gmail.com> | 2020-10-22 18:32:12 +0200 | 
| commit | 38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (patch) | |
| tree | 5efaa26b359615b395d620222e51c001e272cc85 | |
| parent | 9aeacc07be8151a0d44d25cbe377c9f4a09a620a (diff) | |
| download | fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.tar.gz fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.zip | |
update README
| -rw-r--r-- | README.md | 34 | 
1 files changed, 25 insertions, 9 deletions
| @@ -11,33 +11,43 @@ Scholar](https://scholar.google.com/scholar?q=fuzzy+matching) group  publications into clusters. Each cluster represents one publication, abstracted  from its concrete representation as a link to a PDF. -We call the abstract publication *work* and the concrete instance a *release*. -The goal is to group releases under works and to implement a versions feature. +We call the abstract publication +[work](https://guide.fatcat.wiki/entity_work.html) and the concrete instance a +[release](https://guide.fatcat.wiki/entity_release.html). One goal is to group +releases under works and to implement a versions feature (self-match). Another +goal is to have support for matching of external lists (e.g. title lists or +other document) to the existing records.  This repository contains both generic code for matching as well as fatcat  specific code using the fatcat openapi client.  ## Approach -There are probably a few assumption we can make: +* Local code, with command line entry points for matching as well as adapter +  for fatcat. + +A few assumption we need to make:  * If two strings are given, an exact string match does not mean equality (at    all), e.g.  "Acta geographica" has currently eight associated ISSN, and a -title like "Buchbesprechungen" appears many hundreds of times. -* ... -* ... +title like "Buchbesprechungen" appears many hundreds of times. We need a bit +more context for a decision.  ## Datasets -* release and container metadata from: [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05). +Relevant datasets are: + +* release and container metadata from a bulk fatcat export, e.g. [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05)  * issn journal level data, via [issnlister](https://github.com/miku/issnlister) -* abbreviation lists +* journal abbreviation lists  ## Matching approaches   -## Performance data point +## Performance data points + +### Against elasticsearch  Candidate generation via elasticsearch, 40 parallel queries, sustained speed at  about 17857 queries per hour, that is around 5 queries/s. @@ -52,6 +62,12 @@ user    29177m5.516s  sys     4927m3.277s  ``` +### Without a search index + +Candidate grouping for self-match can be done locally by extracting a key per +document, then a group by (via sort and uniq). Clustering 150M docs took about +607min (around 4k docs/s, no verification step). +  ## Data issues  ### A republished article | 
