diff options
author | Martin Czygan <martin.czygan@gmail.com> | 2021-09-24 13:58:51 +0200 |
---|---|---|
committer | Martin Czygan <martin.czygan@gmail.com> | 2021-09-24 13:58:51 +0200 |
commit | 478d7d06ad9e56145cb94f3461c355b1ba9eb491 (patch) | |
tree | fa467290e8c8df41a1e97a6de751d0f7e790c9de /extra/grobid_references | |
parent | 86cc3191ce03042ef4a0c6c8a44f4094a140b802 (diff) | |
download | fuzzycat-478d7d06ad9e56145cb94f3461c355b1ba9eb491.tar.gz fuzzycat-478d7d06ad9e56145cb94f3461c355b1ba9eb491.zip |
start larger refactoring: remove cluster
background: verifying hundreds of millions of documents turned out to be
a bit slow; anecdata: running clustering and verification over 1.8B
inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for
those operations. Also: with Go we do not need the extra GNU parallel
wrapping.
In any case, we aim for fuzzycat refactoring to provide:
* better, more configurable verification and small scale matching
* removal of batch clustering code (and improve refcat docs)
* a place for a bit more generic, similarity based utils
The most important piece in fuzzycat is a CSV file containing hand
picked test examples for verification - and the code that is able to
fulfill that test suite. We want to make this part more robust.
Diffstat (limited to 'extra/grobid_references')
0 files changed, 0 insertions, 0 deletions