aboutsummaryrefslogtreecommitdiffstats
path: root/README.md
diff options
context:
space:
mode:
authorMartin Czygan <martin.czygan@gmail.com>2020-10-22 18:32:12 +0200
committerMartin Czygan <martin.czygan@gmail.com>2020-10-22 18:32:12 +0200
commit38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (patch)
tree5efaa26b359615b395d620222e51c001e272cc85 /README.md
parent9aeacc07be8151a0d44d25cbe377c9f4a09a620a (diff)
downloadfuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.tar.gz
fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.zip
update README
Diffstat (limited to 'README.md')
-rw-r--r--README.md34
1 files changed, 25 insertions, 9 deletions
diff --git a/README.md b/README.md
index daef5f3..41ef9d4 100644
--- a/README.md
+++ b/README.md
@@ -11,33 +11,43 @@ Scholar](https://scholar.google.com/scholar?q=fuzzy+matching) group
publications into clusters. Each cluster represents one publication, abstracted
from its concrete representation as a link to a PDF.
-We call the abstract publication *work* and the concrete instance a *release*.
-The goal is to group releases under works and to implement a versions feature.
+We call the abstract publication
+[work](https://guide.fatcat.wiki/entity_work.html) and the concrete instance a
+[release](https://guide.fatcat.wiki/entity_release.html). One goal is to group
+releases under works and to implement a versions feature (self-match). Another
+goal is to have support for matching of external lists (e.g. title lists or
+other document) to the existing records.
This repository contains both generic code for matching as well as fatcat
specific code using the fatcat openapi client.
## Approach
-There are probably a few assumption we can make:
+* Local code, with command line entry points for matching as well as adapter
+ for fatcat.
+
+A few assumption we need to make:
* If two strings are given, an exact string match does not mean equality (at
all), e.g. "Acta geographica" has currently eight associated ISSN, and a
-title like "Buchbesprechungen" appears many hundreds of times.
-* ...
-* ...
+title like "Buchbesprechungen" appears many hundreds of times. We need a bit
+more context for a decision.
## Datasets
-* release and container metadata from: [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05).
+Relevant datasets are:
+
+* release and container metadata from a bulk fatcat export, e.g. [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05)
* issn journal level data, via [issnlister](https://github.com/miku/issnlister)
-* abbreviation lists
+* journal abbreviation lists
## Matching approaches
![](static/approach.png)
-## Performance data point
+## Performance data points
+
+### Against elasticsearch
Candidate generation via elasticsearch, 40 parallel queries, sustained speed at
about 17857 queries per hour, that is around 5 queries/s.
@@ -52,6 +62,12 @@ user 29177m5.516s
sys 4927m3.277s
```
+### Without a search index
+
+Candidate grouping for self-match can be done locally by extracting a key per
+document, then a group by (via sort and uniq). Clustering 150M docs took about
+607min (around 4k docs/s, no verification step).
+
## Data issues
### A republished article