update README

author: Martin Czygan <martin.czygan@gmail.com> 2020-10-22 18:32:12 +0200
committer: Martin Czygan <martin.czygan@gmail.com> 2020-10-22 18:32:12 +0200
commit: 38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (patch)
tree: 5efaa26b359615b395d620222e51c001e272cc85 /README.md
parent: 9aeacc07be8151a0d44d25cbe377c9f4a09a620a (diff)
download: fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.tar.gz
fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.zip
1 files changed, 25 insertions, 9 deletions
diff --git a/README.md b/README.md
index daef5f3..41ef9d4 100644
--- a/README.md
+++ b/README.md
@@ -11,33 +11,43 @@ Scholar](https://scholar.google.com/scholar?q=fuzzy+matching) group
 publications into clusters. Each cluster represents one publication, abstracted
 from its concrete representation as a link to a PDF.
 
-We call the abstract publication *work* and the concrete instance a *release*.
-The goal is to group releases under works and to implement a versions feature.
+We call the abstract publication
+[work](https://guide.fatcat.wiki/entity_work.html) and the concrete instance a
+[release](https://guide.fatcat.wiki/entity_release.html). One goal is to group
+releases under works and to implement a versions feature (self-match). Another
+goal is to have support for matching of external lists (e.g. title lists or
+other document) to the existing records.
 
 This repository contains both generic code for matching as well as fatcat
 specific code using the fatcat openapi client.
 
 ## Approach
 
-There are probably a few assumption we can make:
+* Local code, with command line entry points for matching as well as adapter
+  for fatcat.
+
+A few assumption we need to make:
 
 * If two strings are given, an exact string match does not mean equality (at
   all), e.g.  "Acta geographica" has currently eight associated ISSN, and a
-title like "Buchbesprechungen" appears many hundreds of times.
-* ...
-* ...
+title like "Buchbesprechungen" appears many hundreds of times. We need a bit
+more context for a decision.
 
 ## Datasets
 
-* release and container metadata from: [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05).
+Relevant datasets are:
+
+* release and container metadata from a bulk fatcat export, e.g. [https://archive.org/details/fatcat_bulk_exports_2020-08-05](https://archive.org/details/fatcat_bulk_exports_2020-08-05)
 * issn journal level data, via [issnlister](https://github.com/miku/issnlister)
-* abbreviation lists
+* journal abbreviation lists
 
 ## Matching approaches
 
 ![](static/approach.png)
 
-## Performance data point
+## Performance data points
+
+### Against elasticsearch
 
 Candidate generation via elasticsearch, 40 parallel queries, sustained speed at
 about 17857 queries per hour, that is around 5 queries/s.
@@ -52,6 +62,12 @@ user    29177m5.516s
 sys     4927m3.277s
 ```
 
+### Without a search index
+
+Candidate grouping for self-match can be done locally by extracting a key per
+document, then a group by (via sort and uniq). Clustering 150M docs took about
+607min (around 4k docs/s, no verification step).
+
 ## Data issues
 
 ### A republished article
author	Martin Czygan <martin.czygan@gmail.com>	2020-10-22 18:32:12 +0200
committer	Martin Czygan <martin.czygan@gmail.com>	2020-10-22 18:32:12 +0200
commit	38b45bc6738b0d53326ee6a62dff15fcb62cfa9c (patch)
tree	5efaa26b359615b395d620222e51c001e272cc85 /README.md
parent	9aeacc07be8151a0d44d25cbe377c9f4a09a620a (diff)
download	fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.tar.gz fuzzycat-38b45bc6738b0d53326ee6a62dff15fcb62cfa9c.zip