| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Instead, use `FuzzyReleaseMatcher.match`, which has approximately the
same behavior.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
background: verifying hundreds of millions of documents turned out to be
a bit slow; anecdata: running clustering and verification over 1.8B
inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for
those operations. Also: with Go we do not need the extra GNU parallel
wrapping.
In any case, we aim for fuzzycat refactoring to provide:
* better, more configurable verification and small scale matching
* removal of batch clustering code (and improve refcat docs)
* a place for a bit more generic, similarity based utils
The most important piece in fuzzycat is a CSV file containing hand
picked test examples for verification - and the code that is able to
fulfill that test suite. We want to make this part more robust.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Output currently (1m sample):
{
"unique": 916075,
"too_large": 575,
"dummy": 10307,
"contrib_miss": 27215,
"short_title": 1379,
"arxiv_v": 8943
}
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
into bnewbold-bnewbold-sandcrawler
* 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat:
sandcrawler slugify: yet more unicode corner-cases
add sandcrawler-style title key method
cluster: count empty keys (and don't return them)
pipenv: explicit regex dependency
gitignore: add .swp (vim)
make: run pytest over fuzzycat/ to catch inline tests
add support for key denylist
|
|
|
|
| |
The cluster class should work with iterable, so testing will be easier.
|
|
|