fuzzycat - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	complete migration from away from match_release_fuzzy	Martin Czygan	2021-11-16	1	-2/+3
\| \| \| \| \|	Instead, use `FuzzyReleaseMatcher.match`, which has approximately the same behavior.
*	start larger refactoring: remove cluster	Martin Czygan	2021-09-24	1	-55/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
*	lint: remove unused imports	Bryan Newbold	2021-05-31	1	-1/+0
\|
*	main: 'unstructured' CLI demo	Bryan Newbold	2021-04-14	1	-1/+38
\|
*	update notes	Martin Czygan	2021-02-11	1	-4/+10
\|
*	add a batch verifier for ref groups	Martin Czygan	2021-02-11	1	-0/+9
\|
*	add shellout helper	Martin Czygan	2021-02-02	1	-1/+4
\|
*	add -C flag for compression	Martin Czygan	2021-02-02	1	-0/+2
\|
*	update docs	Martin Czygan	2020-12-17	1	-0/+1
\|
*	add flags	Martin Czygan	2020-12-17	1	-1/+9
\|
*	fix name	Martin Czygan	2020-12-16	1	-1/+1
\|
*	add missing function	Martin Czygan	2020-12-16	1	-1/+1
\|
*	update reference	Martin Czygan	2020-12-16	1	-1/+1
\|
*	add todo	Martin Czygan	2020-12-16	1	-0/+3
\|
*	docs and release match command	Martin Czygan	2020-12-16	1	-13/+118
\|
*	cleanup	Martin Czygan	2020-12-15	1	-1/+3
\|
*	fix cmdline tool	Martin Czygan	2020-12-15	1	-3/+11
\|
*	fix verification	Martin Czygan	2020-12-15	1	-1/+1
\|
*	single item verification	Martin Czygan	2020-12-15	1	-1/+40
\|
*	fix param, return	Martin Czygan	2020-11-28	1	-2/+3
\|
*	add --min-cluster-size flag to cluster subcommand	Martin Czygan	2020-11-26	1	-0/+4
\|
*	wip: verification	Martin Czygan	2020-11-13	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Output currently (1m sample): { "unique": 916075, "too_large": 575, "dummy": 10307, "contrib_miss": 27215, "short_title": 1379, "arxiv_v": 8943 }
*	Merge branch 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat ↵	Martin Czygan	2020-11-12	1	-27/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	into bnewbold-bnewbold-sandcrawler * 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat: sandcrawler slugify: yet more unicode corner-cases add sandcrawler-style title key method cluster: count empty keys (and don't return them) pipenv: explicit regex dependency gitignore: add .swp (vim) make: run pytest over fuzzycat/ to catch inline tests add support for key denylist
*	move fileinput.input out of the cluster	Martin Czygan	2020-11-12	1	-3/+3
\| \| \| \|	The cluster class should work with iterable, so testing will be easier.
*	move main.py to __main__.py	Martin Czygan	2020-11-12	1	-0/+121