fuzzycat - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	apply first round of feedback on matchingHEAD master	Martin Czygan	2021-12-21	13	-9/+73
\|
*	matching: track_total_hits, use False	Martin Czygan	2021-12-16	1	-4/+4
\| \| \| \|	integer, despite supported according to the docs, yielded a 400 parse-error
*	matching: we do not need exact match counts	Martin Czygan	2021-12-16	1	-4/+4
\| \| \| \|	up to 100 or even will be ok; see also: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/search-your-data.html#track-total-hits
*	matching: add hdl, remove mag id	Martin Czygan	2021-12-16	1	-1/+1
\|
*	matching: cleanup and documentation	Martin Czygan	2021-12-07	1	-47/+29
\|
*	matching: update docs	Martin Czygan	2021-12-07	1	-9/+8
\|
*	v0.1.23	Martin Czygan	2021-12-06	1	-1/+1
\|
*	matching: cleanup test files	Martin Czygan	2021-12-06	24	-202/+1
\|
*	complete FuzzyReleaseMatcher refactoring	Martin Czygan	2021-12-06	14	-362/+644
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We keep the name, since the api - "matcher.match(release)" - is the same; simplified queries; at most one query is performed against elasticsearch; parallel release retrieval from the API; optional support for release year windows; Test cases are expressed in yaml and will be auto-loaded from the specified directory; test work against the current search endpoint, which means the actual output may change on index updates; for the moment, we think this setup is relatively simple and not too unstable. about: title contrib, partial name input: > { "contribs": [ { "raw_name": "Adams" } ], "title": "digital libraries", "ext_ids": {} } release_year_padding: 1 expected: - 7rmvqtrb2jdyhcxxodihzzcugy - a2u6ougtsjcbvczou6sazsulcm - dy45vilej5diros6zmax46nm4e - exuwhhayird4fdjmmsiqpponlq - gqrj7jikezgcfpjfazhpf4e7c4 - mkmqt3453relbpuyktnmsg6hjq - t2g5sl3dgzchtnq7dynxyzje44 - t4tvenhrvzamraxrvvxivxmvga - wd3oeoi3bffknfbg2ymleqc4ja - y63a6dhrfnb7bltlxfynydbojy
*	complete migration from away from match_release_fuzzy	Martin Czygan	2021-11-16	4	-247/+7
\| \| \| \| \|	Instead, use `FuzzyReleaseMatcher.match`, which has approximately the same behavior.
*	update todo	Martin Czygan	2021-11-16	1	-1/+2
\|
*	Merge branch 'martin-matcher-class' into 'master'	Martin Czygan	2021-11-16	24	-112/+1410
\|\ \| \| \| \| \| \| \| \|	turn "match_release_fuzzy" into a class See merge request webgroup/fuzzycat!10
\| *	use elasticsearch <7.14 search args	Martin Czygan	2021-11-16	1	-11/+47
\| \|
\| *	setup: add missing pyyaml dependency	Martin Czygan	2021-11-16	1	-0/+1
\| \|
\| *	setup: add thefuzz dependency	Martin Czygan	2021-11-16	1	-1/+2
\| \|
\| *	turn "match_release_fuzzy" into a class	Martin Czygan	2021-11-16	23	-111/+1371
\|/ \| \| \| \| \| \| \|	Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`.
*	Merge branch 'bnewbold-grobid-tei-xml' into 'master'	Martin Czygan	2021-11-04	4	-277/+56
\|\ \| \| \| \| \| \| \| \|	use grobid_tei_xml for grobid unstructured lookups See merge request webgroup/fuzzycat!9
\| *	use grobid_tei_xml for grobid unstructured lookups	Bryan Newbold	2021-10-28	4	-277/+56
\|/
*	Merge branch 'bnewbold-tweaks' into 'master'	Martin Czygan	2021-10-28	3	-3/+5
\|\ \| \| \| \| \| \| \| \|	tweaks to deps and packaging; add files,contribs in live match release lookups See merge request webgroup/fuzzycat!8
\| *	bump fatcat-openapi-client version to 0.4.0	Bryan Newbold	2021-10-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	There isn't any new feature required in the new version of the client library, but feels like we should aggressively update everywhere when possible.
\| *	matching: include contribs,files in release entity	Bryan Newbold	2021-10-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes several downstream applications simpler, like showing PDF links without an additional fatcat API fetch. The 'contrib' entities may be required as part of bibliographic matching (checking the creator names as well as the release-local versions of the name) In theory we could add webcaptures,filesets as well, but those are still rare, and occasionally result in very large sub-documents.
\| *	packaging: include py.typed for mypy to detect	Bryan Newbold	2021-10-27	2	-0/+1
\| \|
\| *	deps: pin elasticsearch to less than 7.14	Bryan Newbold	2021-10-27	1	-1/+2
\|/ \| \| \| \|	This is to avoid 'elasticsearch.exceptions.UnsupportedProductError' errors in newer versions of the elasticsearch client libraries.
*	start larger refactoring: remove cluster	Martin Czygan	2021-09-24	9	-723/+188
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
*	setup: narrow dependency versions	Martin Czygan	2021-09-21	1	-4/+4
\|
*	Merge branch 'wip-martin-review-cleanup' into 'master'	Martin Czygan	2021-09-21	13	-20/+272
\|\ \| \| \| \| \| \| \| \|	review notes and some cleanup See merge request webgroup/fuzzycat!7
\| *	tests: temporarily disable tests	Martin Czygan	2021-09-21	1	-12/+12
\| \| \| \| \| \| \| \| \| \|	We want to first move to elasticsearch dsl and will reactivate and extends after refactoring.
\| *	matching: run an additional es query for fuzzy matching	Martin Czygan	2021-09-21	2	-3/+93
\| \|
\| *	reorganize notes	Martin Czygan	2021-09-21	6	-2/+153
\| \|
\| *	style: apply formatting	Martin Czygan	2021-09-21	7	-11/+22
\|/
*	matching: actually return the specified number of results	Martin Czygan	2021-09-15	1	-2/+2
\|
*	add todo	Martin Czygan	2021-09-14	1	-0/+28
\|
*	remove pipenv related files	Martin Czygan	2021-09-13	5	-979/+25
\| \| \| \| \| \| \| \| \|	fuzzycat is mostly a library; the command line tool will switch to a bundled executable (e.g. via shiv) soon; removed pipenv in order to lower confusion which setup to use; also pipenv unfortunately at time cat take a bit of time to complete operations
*	v0.1.22	Martin Czygan	2021-09-13	1	-1/+1
\|
*	cluster: adjust tests to jellyfish nysiis implementation	Martin Czygan	2021-09-13	1	-7/+7
\|
*	update README	Martin Czygan	2021-09-13	1	-4/+7
\|
*	remove dependency on fuzzy; use jellyfish	Martin Czygan	2021-09-13	4	-304/+286
\|
*	cleanup makefile	Martin Czygan	2021-09-13	1	-2/+0
\|
*	update mentions of cgraph to refcat	Bryan Newbold	2021-09-10	2	-2/+2
\|
*	Merge branch 'master' of git.archive.org:webgroup/fuzzycat	Martin Czygan	2021-07-09	8	-224/+318
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'master' of git.archive.org:webgroup/fuzzycat: simplify README for general audience; move some content to notes sandcrawler slugify: lower-case greek ambiguity (OCR) DOI clean/normalize helper; and use in verification etc verify: page count parsing and comparison improvements
\| *	Merge branch 'bnewbold-readme' into 'master'	Martin Czygan	2021-07-07	2	-210/+245
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \|	simplify README for general audience; move some content to notes See merge request webgroup/fuzzycat!6
\| \| *	simplify README for general audience; move some content to notes	Bryan Newbold	2021-07-01	2	-210/+245
\| \| \|
\| * \|	Merge branch 'bnewbold-verify-improvements' into 'master'	Martin Czygan	2021-07-02	6	-14/+73
\| \|\ \ \| \| \|/ \| \|/\| \| \| \| \| \| \|	verify improvements See merge request webgroup/fuzzycat!4
\| \| *	sandcrawler slugify: lower-case greek ambiguity (OCR)	Bryan Newbold	2021-07-01	1	-2/+13
\| \| \|
\| \| *	DOI clean/normalize helper; and use in verification etc	Bryan Newbold	2021-07-01	5	-6/+35
\| \| \|
\| \| *	verify: page count parsing and comparison improvements	Bryan Newbold	2021-07-01	3	-6/+25
\| \|/
* \|	add a few (open) tests cases	Martin Czygan	2021-07-09	6	-0/+176
\| \|
* \|	notes on matching metrics	Martin Czygan	2021-07-08	1	-0/+16
\| \|
* \|	cleanup notes	Martin Czygan	2021-07-08	2	-13/+0
\|/
*	add test case	Martin Czygan	2021-06-21	4	-0/+1339
\|