update README

author: Martin Czygan <martin.czygan@gmail.com> 2020-12-23 02:56:25 +0100
committer: Martin Czygan <martin.czygan@gmail.com> 2020-12-23 02:56:25 +0100
commit: 5dbdecb26dcce7411581af7d7463d7eb4b02e64a (patch)
tree: 2a551ffd0228a8abc21271b51b0d8b2ccb72366d
parent: 39f379f5b8da3f1912d92477feb3b38d1bb891d7 (diff)
download: fuzzycat-5dbdecb26dcce7411581af7d7463d7eb4b02e64a.tar.gz
fuzzycat-5dbdecb26dcce7411581af7d7463d7eb4b02e64a.zip
1 files changed, 22 insertions, 10 deletions
diff --git a/README.md b/README.md
index 59bca2c..e3438ae 100644
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@ Fuzzy matching utilities for [fatcat](https://fatcat.wiki).
 
 ![https://pypi.org/project/fuzzycat/](https://img.shields.io/pypi/v/fuzzycat?style=flat-square)
 
-To install with [pip](https://pypi.org/project/pip/):
+To install with [pip](https://pypi.org/project/pip/), run:
 
 ```
 $ pip install fuzzycat
@@ -23,6 +23,19 @@ For example we can identify:
 * preprint and published pairs
 * similar items from different sources
 
+## TODO
+
+* [ ] take a list of title strings and return match candidates (faster than
+  elasticsearch); e.g. derive a key and find similar keys some cached clusters
+* [ ] take a list of title, author documents and return match candidates; e.g.
+  key may depend on title only, but verification can be more precise
+* [ ] take a more complete, yet partial document and return match candidates
+
+For this to work, we will need to have cluster from fatcat precomputed and
+cache. We also might want to have it sorted by key (which is a side effect of
+clustering) so we can binary search into the cluster file for the above todo
+items.
+
 ## Dataset
 
 For development, we worked on a `release_export_expanded.json` dump (113G/700G
@@ -30,9 +43,7 @@ zstd/plain, XXX lines) and with the [fatcat API](https://api.fatcat.wiki/).
 
 ![](notes/steps.png)
 
-## Facilities
-
-### Clustering
+## Clustering
 
 Clustering derives sets of similar documents from a [fatcat database release
 dump](https://archive.org/details/fatcat_snapshots_and_exports?&sort=-publicdate).
@@ -56,7 +67,7 @@ Clustering works in a three step process:
 2. sorting by keys (via [GNU sort](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html))
 3. group by key and write out ([itertools.groupby](https://docs.python.org/3/library/itertools.html#itertools.groupby))
 
-### Verification
+## Verification
 
 Run verification (pairwise *double-check* of match candidates in a cluster).
 
@@ -136,9 +147,7 @@ user    2605m41.347s
 sys     118m38.141s
 ```
 
-So, 29881072 (about 20%) docs in the potentially duplicated set.
-
-Verification (about 15h w/o parallel):
+So, 29881072 (about 20%) docs in the potentially duplicated set. Verification (about 15h w/o parallel):
 
 ```
 $ time zstdcat -T0 cluster_tsandcrawler.json.zst | python -m fuzzycat verify | \
@@ -151,8 +160,11 @@ user    939m32.761s
 sys     36m47.602s
 ```
 
+----
+
+# Misc
 
-# Use cases
+## Use cases
 
 * [ ] take a release entity database dump as JSON lines and cluster releases
   (according to various algorithms)
@@ -162,7 +174,7 @@ sys     36m47.602s
   strings to release titles (this needs some transparent setup, e.g. filling of
 a cache before ops)
 
-# Usage
+## Usage
 
 Release clusters start with release entities json lines.
author	Martin Czygan <martin.czygan@gmail.com>	2020-12-23 02:56:25 +0100
committer	Martin Czygan <martin.czygan@gmail.com>	2020-12-23 02:56:25 +0100
commit	5dbdecb26dcce7411581af7d7463d7eb4b02e64a (patch)
tree	2a551ffd0228a8abc21271b51b0d8b2ccb72366d
parent	39f379f5b8da3f1912d92477feb3b38d1bb891d7 (diff)
download	fuzzycat-5dbdecb26dcce7411581af7d7463d7eb4b02e64a.tar.gz fuzzycat-5dbdecb26dcce7411581af7d7463d7eb4b02e64a.zip