fuzzycat - Unnamed repository; edit this file 'description' to name the repository.

	Commit message (Collapse)	Author	Age	Files	Lines
*	matching: cleanup test files	Martin Czygan	2021-12-06	24	-202/+1
\|
*	complete FuzzyReleaseMatcher refactoring	Martin Czygan	2021-12-06	11	-84/+437
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We keep the name, since the api - "matcher.match(release)" - is the same; simplified queries; at most one query is performed against elasticsearch; parallel release retrieval from the API; optional support for release year windows; Test cases are expressed in yaml and will be auto-loaded from the specified directory; test work against the current search endpoint, which means the actual output may change on index updates; for the moment, we think this setup is relatively simple and not too unstable. about: title contrib, partial name input: > { "contribs": [ { "raw_name": "Adams" } ], "title": "digital libraries", "ext_ids": {} } release_year_padding: 1 expected: - 7rmvqtrb2jdyhcxxodihzzcugy - a2u6ougtsjcbvczou6sazsulcm - dy45vilej5diros6zmax46nm4e - exuwhhayird4fdjmmsiqpponlq - gqrj7jikezgcfpjfazhpf4e7c4 - mkmqt3453relbpuyktnmsg6hjq - t2g5sl3dgzchtnq7dynxyzje44 - t4tvenhrvzamraxrvvxivxmvga - wd3oeoi3bffknfbg2ymleqc4ja - y63a6dhrfnb7bltlxfynydbojy
*	complete migration from away from match_release_fuzzy	Martin Czygan	2021-11-16	1	-81/+1
\| \| \| \| \|	Instead, use `FuzzyReleaseMatcher.match`, which has approximately the same behavior.
*	turn "match_release_fuzzy" into a class	Martin Czygan	2021-11-16	16	-12/+323
\| \| \| \| \| \| \| \|	Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`.
*	use grobid_tei_xml for grobid unstructured lookups	Bryan Newbold	2021-10-28	1	-26/+32
\|
*	start larger refactoring: remove cluster	Martin Czygan	2021-09-24	3	-199/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
*	tests: temporarily disable tests	Martin Czygan	2021-09-21	1	-12/+12
\| \| \| \| \|	We want to first move to elasticsearch dsl and will reactivate and extends after refactoring.
*	matching: run an additional es query for fuzzy matching	Martin Czygan	2021-09-21	1	-2/+20
\|
*	style: apply formatting	Martin Czygan	2021-09-21	2	-3/+13
\|
*	cluster: adjust tests to jellyfish nysiis implementation	Martin Czygan	2021-09-13	1	-7/+7
\|
*	Merge branch 'master' of git.archive.org:webgroup/fuzzycat	Martin Czygan	2021-07-09	1	-3/+21
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'master' of git.archive.org:webgroup/fuzzycat: simplify README for general audience; move some content to notes sandcrawler slugify: lower-case greek ambiguity (OCR) DOI clean/normalize helper; and use in verification etc verify: page count parsing and comparison improvements
\| *	DOI clean/normalize helper; and use in verification etc	Bryan Newbold	2021-07-01	1	-1/+14
\| \|
\| *	verify: page count parsing and comparison improvements	Bryan Newbold	2021-07-01	1	-2/+7
\| \|
* \|	add a few (open) tests cases	Martin Czygan	2021-07-09	6	-0/+176
\|/
*	add test case	Martin Czygan	2021-06-21	4	-0/+1339
\|
*	lint: remove unused imports	Bryan Newbold	2021-05-31	1	-1/+0
\|
*	add test case	Martin Czygan	2021-05-26	3	-0/+83
\|
*	add test	Martin Czygan	2021-05-12	3	-0/+603
\|
*	add test cases	Martin Czygan	2021-05-06	10	-0/+1861
\|
*	add test case	Martin Czygan	2021-04-20	3	-0/+107
\|
*	add test	Martin Czygan	2021-04-17	3	-0/+1982
\|
*	Merge branch 'bnewbold-upstreaming' into 'master'	Martin Czygan	2021-04-15	2	-0/+172
\|\ \| \| \| \| \| \| \| \|	refactoring/upstreaming fuzzycat "live" matching helpers See merge request webgroup/fuzzycat!2
\| *	add 'simple' high-level routines for fuzzy-match-and-verify calls	Bryan Newbold	2021-04-14	1	-0/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some of these are a little redundant, in that calling code could trivially re-implement. However, I think these are good starters for stable external API interfaces, leaving us room to iterate and refactor lower-level implementations behind the scenes.
\| *	GROBID API unstructured citation parsing utility code	Bryan Newbold	2021-04-14	1	-0/+130
\| \|
* \|	cleanup merge artifact	Martin Czygan	2021-04-15	1	-1/+0
\| \|
* \|	Merge branch 'bnewbold-dev-setup'	Martin Czygan	2021-04-15	1	-1/+8
\|\\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* bnewbold-dev-setup: dynaconf: switch to fuzzycat.config import across project upgrade to python3.8 gitlab CI: try 'make deps' and 'make test' makefile: run common commands inside pipenv makefile: change 'deps' to be simple --dev --deploy make fmt
\| *	dynaconf: switch to fuzzycat.config import across project	Bryan Newbold	2021-04-13	1	-2/+1
\| \| \| \| \| \| \| \|	This is the recommended way to use dynaconf.
\| *	make fmt	Bryan Newbold	2021-04-13	1	-3/+14
\| \|
* \|	fix imports and formatting	Martin Czygan	2021-04-14	2	-8/+26
\| \|
* \|	test: skip if configured search server is not reachable	Martin Czygan	2021-04-14	1	-0/+14
\| \|
* \|	tests: run es tests against public search endpoint	Martin Czygan	2021-04-14	1	-8/+31
\|/
*	address es hits.total change in ES7	Martin Czygan	2021-04-12	1	-1/+10
\| \| \| \|	* https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html
*	add compress kwarg to cluster	Martin Czygan	2021-02-02	8	-8/+31
\| \| \| \|	Will compress intermediate results with zstd (https://git.io/Jt00y9).
*	add case	Martin Czygan	2021-02-01	3	-0/+171
\|
*	add case;	Martin Czygan	2021-01-29	3	-0/+132
\| \| \| \|	different, but related; verify says: "strong"
*	add case	Martin Czygan	2021-01-28	5	-0/+710
\|
*	add case	Martin Czygan	2021-01-27	3	-0/+440
\|
*	add case	Martin Czygan	2021-01-27	3	-0/+268
\| \| \| \|	same DOI, but repeated slash
*	add case	Martin Czygan	2021-01-26	3	-0/+156
\|
*	add case; probably similar but yields different	Martin Czygan	2021-01-23	3	-0/+87
\|
*	add case	Martin Czygan	2021-01-20	3	-0/+139
\|
*	add case	Martin Czygan	2021-01-15	3	-0/+786
\|
*	add case	Martin Czygan	2021-01-15	3	-0/+422
\|
*	add case; article republished	Martin Czygan	2021-01-15	3	-0/+79
\| \| \| \| \| \|	unfortunately, md is partial as for page count (e.g. "29" in md, but "29-45" on publisher site: https://academic.oup.com/restud/article-abstract/41/5/29/1522050)
*	add case (article, comment)	Martin Czygan	2021-01-14	3	-0/+75
\| \| \| \| \|	currently, status is STRONG; having article and comments attached to a single work item might be useful
*	case: translation in title	Martin Czygan	2021-01-08	5	-0/+177
\|
*	add cases	Martin Czygan	2021-01-04	3	-3/+364
\|
*	add test case	Martin Czygan	2021-01-04	3	-0/+93
\|
*	fix cases	Martin Czygan	2020-12-29	1	-9/+9
\|
*	add cases for a couple of reviews	Martin Czygan	2020-12-29	11	-0/+441
\|