fuzzycat/tests, branch master

fuzzycat/tests, branch master Unnamed repository; edit this file 'description' to name the repository. https://git.bnewbold.net/fuzzycat/atom?h=master 2021-12-21T19:56:56Z apply first round of feedback on matching 2021-12-21T19:56:56Z Martin Czygan martin.czygan@gmail.com 2021-12-17T09:07:15Z urn:sha1:de9f1155ea57c812171abd5517ab39f4fe135cb3 matching: cleanup test files 2021-12-06T18:59:51Z Martin Czygan martin.czygan@gmail.com 2021-12-06T18:59:51Z urn:sha1:5bd8ee08a3e0f52893c1b7afa6bc4f062b7c062c complete FuzzyReleaseMatcher refactoring 2021-12-06T18:53:30Z Martin Czygan martin.czygan@gmail.com 2021-11-17T13:51:50Z urn:sha1:dd6149140542585f2b0bfc3b334ec2b0a88b790e We keep the name, since the api - "matcher.match(release)" - is the same; simplified queries; at most one query is performed against elasticsearch; parallel release retrieval from the API; optional support for release year windows; Test cases are expressed in yaml and will be auto-loaded from the specified directory; test work against the current search endpoint, which means the actual output may change on index updates; for the moment, we think this setup is relatively simple and not too unstable. about: title contrib, partial name input: > { "contribs": [ { "raw_name": "Adams" } ], "title": "digital libraries", "ext_ids": {} } release_year_padding: 1 expected: - 7rmvqtrb2jdyhcxxodihzzcugy - a2u6ougtsjcbvczou6sazsulcm - dy45vilej5diros6zmax46nm4e - exuwhhayird4fdjmmsiqpponlq - gqrj7jikezgcfpjfazhpf4e7c4 - mkmqt3453relbpuyktnmsg6hjq - t2g5sl3dgzchtnq7dynxyzje44 - t4tvenhrvzamraxrvvxivxmvga - wd3oeoi3bffknfbg2ymleqc4ja - y63a6dhrfnb7bltlxfynydbojy complete migration from away from match_release_fuzzy 2021-11-16T20:13:46Z Martin Czygan martin.czygan@gmail.com 2021-11-16T20:13:46Z urn:sha1:d104f8d0ba8eef5563555de82be66bbf17f961db Instead, use `FuzzyReleaseMatcher.match`, which has approximately the same behavior. turn "match_release_fuzzy" into a class 2021-11-16T17:58:42Z Martin Czygan martin.czygan@gmail.com 2021-11-05T16:19:07Z urn:sha1:0c84af603894049dd8edd95da18d8990ab0516d1 Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`. use grobid_tei_xml for grobid unstructured lookups 2021-10-28T21:00:49Z Bryan Newbold bnewbold@archive.org 2021-10-28T21:00:36Z urn:sha1:2f41335d268b0e2705a1ebff0ff104e965630837 start larger refactoring: remove cluster 2021-09-24T11:58:51Z Martin Czygan martin.czygan@gmail.com 2021-09-24T11:58:51Z urn:sha1:478d7d06ad9e56145cb94f3461c355b1ba9eb491 background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust. tests: temporarily disable tests 2021-09-21T14:36:55Z Martin Czygan martin.czygan@gmail.com 2021-09-21T14:36:55Z urn:sha1:5fa61d89320af880d5bf6b3231f6478887cfb6a6 We want to first move to elasticsearch dsl and will reactivate and extends after refactoring. matching: run an additional es query for fuzzy matching 2021-09-21T13:55:52Z Martin Czygan martin.czygan@gmail.com 2021-09-21T13:55:52Z urn:sha1:dccbaa5c1b0ba556449de6024540ba05d67ef6a0 style: apply formatting 2021-09-21T13:54:46Z Martin Czygan martin.czygan@gmail.com 2021-09-21T13:54:46Z urn:sha1:08a9242e2ed19aaec14d92fe174bee21bb4232eb