aboutsummaryrefslogtreecommitdiffstats
path: root/fuzzycat/__main__.py
Commit message (Collapse)AuthorAgeFilesLines
* complete migration from away from match_release_fuzzyMartin Czygan2021-11-161-2/+3
| | | | | Instead, use `FuzzyReleaseMatcher.match`, which has approximately the same behavior.
* start larger refactoring: remove clusterMartin Czygan2021-09-241-55/+0
| | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
* lint: remove unused importsBryan Newbold2021-05-311-1/+0
|
* main: 'unstructured' CLI demoBryan Newbold2021-04-141-1/+38
|
* update notesMartin Czygan2021-02-111-4/+10
|
* add a batch verifier for ref groupsMartin Czygan2021-02-111-0/+9
|
* add shellout helperMartin Czygan2021-02-021-1/+4
|
* add -C flag for compressionMartin Czygan2021-02-021-0/+2
|
* update docsMartin Czygan2020-12-171-0/+1
|
* add flagsMartin Czygan2020-12-171-1/+9
|
* fix nameMartin Czygan2020-12-161-1/+1
|
* add missing functionMartin Czygan2020-12-161-1/+1
|
* update referenceMartin Czygan2020-12-161-1/+1
|
* add todoMartin Czygan2020-12-161-0/+3
|
* docs and release match commandMartin Czygan2020-12-161-13/+118
|
* cleanupMartin Czygan2020-12-151-1/+3
|
* fix cmdline toolMartin Czygan2020-12-151-3/+11
|
* fix verificationMartin Czygan2020-12-151-1/+1
|
* single item verificationMartin Czygan2020-12-151-1/+40
|
* fix param, returnMartin Czygan2020-11-281-2/+3
|
* add --min-cluster-size flag to cluster subcommandMartin Czygan2020-11-261-0/+4
|
* wip: verificationMartin Czygan2020-11-131-1/+3
| | | | | | | | | | | | | Output currently (1m sample): { "unique": 916075, "too_large": 575, "dummy": 10307, "contrib_miss": 27215, "short_title": 1379, "arxiv_v": 8943 }
* Merge branch 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat ↵Martin Czygan2020-11-121-27/+10
| | | | | | | | | | | | | into bnewbold-bnewbold-sandcrawler * 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat: sandcrawler slugify: yet more unicode corner-cases add sandcrawler-style title key method cluster: count empty keys (and don't return them) pipenv: explicit regex dependency gitignore: add .swp (vim) make: run pytest over fuzzycat/ to catch inline tests add support for key denylist
* move fileinput.input out of the clusterMartin Czygan2020-11-121-3/+3
| | | | The cluster class should work with iterable, so testing will be easier.
* move main.py to __main__.pyMartin Czygan2020-11-121-0/+121