aboutsummaryrefslogtreecommitdiffstats
path: root/tests
Commit message (Collapse)AuthorAgeFilesLines
* complete FuzzyReleaseMatcher refactoringMartin Czygan2021-12-0611-84/+437
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We keep the name, since the api - "matcher.match(release)" - is the same; simplified queries; at most one query is performed against elasticsearch; parallel release retrieval from the API; optional support for release year windows; Test cases are expressed in yaml and will be auto-loaded from the specified directory; test work against the current search endpoint, which means the actual output may change on index updates; for the moment, we think this setup is relatively simple and not too unstable. about: title contrib, partial name input: > { "contribs": [ { "raw_name": "Adams" } ], "title": "digital libraries", "ext_ids": {} } release_year_padding: 1 expected: - 7rmvqtrb2jdyhcxxodihzzcugy - a2u6ougtsjcbvczou6sazsulcm - dy45vilej5diros6zmax46nm4e - exuwhhayird4fdjmmsiqpponlq - gqrj7jikezgcfpjfazhpf4e7c4 - mkmqt3453relbpuyktnmsg6hjq - t2g5sl3dgzchtnq7dynxyzje44 - t4tvenhrvzamraxrvvxivxmvga - wd3oeoi3bffknfbg2ymleqc4ja - y63a6dhrfnb7bltlxfynydbojy
* complete migration from away from match_release_fuzzyMartin Czygan2021-11-161-81/+1
| | | | | Instead, use `FuzzyReleaseMatcher.match`, which has approximately the same behavior.
* turn "match_release_fuzzy" into a classMartin Czygan2021-11-1616-12/+323
| | | | | | | | Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`.
* use grobid_tei_xml for grobid unstructured lookupsBryan Newbold2021-10-281-26/+32
|
* start larger refactoring: remove clusterMartin Czygan2021-09-243-199/+12
| | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
* tests: temporarily disable testsMartin Czygan2021-09-211-12/+12
| | | | | We want to first move to elasticsearch dsl and will reactivate and extends after refactoring.
* matching: run an additional es query for fuzzy matchingMartin Czygan2021-09-211-2/+20
|
* style: apply formattingMartin Czygan2021-09-212-3/+13
|
* cluster: adjust tests to jellyfish nysiis implementationMartin Czygan2021-09-131-7/+7
|
* Merge branch 'master' of git.archive.org:webgroup/fuzzycatMartin Czygan2021-07-091-3/+21
|\ | | | | | | | | | | | | | | * 'master' of git.archive.org:webgroup/fuzzycat: simplify README for general audience; move some content to notes sandcrawler slugify: lower-case greek ambiguity (OCR) DOI clean/normalize helper; and use in verification etc verify: page count parsing and comparison improvements
| * DOI clean/normalize helper; and use in verification etcBryan Newbold2021-07-011-1/+14
| |
| * verify: page count parsing and comparison improvementsBryan Newbold2021-07-011-2/+7
| |
* | add a few (open) tests casesMartin Czygan2021-07-096-0/+176
|/
* add test caseMartin Czygan2021-06-214-0/+1339
|
* lint: remove unused importsBryan Newbold2021-05-311-1/+0
|
* add test caseMartin Czygan2021-05-263-0/+83
|
* add testMartin Czygan2021-05-123-0/+603
|
* add test casesMartin Czygan2021-05-0610-0/+1861
|
* add test caseMartin Czygan2021-04-203-0/+107
|
* add testMartin Czygan2021-04-173-0/+1982
|
* Merge branch 'bnewbold-upstreaming' into 'master'Martin Czygan2021-04-152-0/+172
|\ | | | | | | | | refactoring/upstreaming fuzzycat "live" matching helpers See merge request webgroup/fuzzycat!2
| * add 'simple' high-level routines for fuzzy-match-and-verify callsBryan Newbold2021-04-141-0/+42
| | | | | | | | | | | | | | Some of these are a little redundant, in that calling code could trivially re-implement. However, I think these are good starters for stable external API interfaces, leaving us room to iterate and refactor lower-level implementations behind the scenes.
| * GROBID API unstructured citation parsing utility codeBryan Newbold2021-04-141-0/+130
| |
* | cleanup merge artifactMartin Czygan2021-04-151-1/+0
| |
* | Merge branch 'bnewbold-dev-setup'Martin Czygan2021-04-151-1/+8
|\| | | | | | | | | | | | | | | | | | | * bnewbold-dev-setup: dynaconf: switch to fuzzycat.config import across project upgrade to python3.8 gitlab CI: try 'make deps' and 'make test' makefile: run common commands inside pipenv makefile: change 'deps' to be simple --dev --deploy make fmt
| * dynaconf: switch to fuzzycat.config import across projectBryan Newbold2021-04-131-2/+1
| | | | | | | | This is the recommended way to use dynaconf.
| * make fmtBryan Newbold2021-04-131-3/+14
| |
* | fix imports and formattingMartin Czygan2021-04-142-8/+26
| |
* | test: skip if configured search server is not reachableMartin Czygan2021-04-141-0/+14
| |
* | tests: run es tests against public search endpointMartin Czygan2021-04-141-8/+31
|/
* address es hits.total change in ES7Martin Czygan2021-04-121-1/+10
| | | | * https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html
* add compress kwarg to clusterMartin Czygan2021-02-028-8/+31
| | | | Will compress intermediate results with zstd (https://git.io/Jt00y9).
* add caseMartin Czygan2021-02-013-0/+171
|
* add case;Martin Czygan2021-01-293-0/+132
| | | | different, but related; verify says: "strong"
* add caseMartin Czygan2021-01-285-0/+710
|
* add caseMartin Czygan2021-01-273-0/+440
|
* add caseMartin Czygan2021-01-273-0/+268
| | | | same DOI, but repeated slash
* add caseMartin Czygan2021-01-263-0/+156
|
* add case; probably similar but yields differentMartin Czygan2021-01-233-0/+87
|
* add caseMartin Czygan2021-01-203-0/+139
|
* add caseMartin Czygan2021-01-153-0/+786
|
* add caseMartin Czygan2021-01-153-0/+422
|
* add case; article republishedMartin Czygan2021-01-153-0/+79
| | | | | | unfortunately, md is partial as for page count (e.g. "29" in md, but "29-45" on publisher site: https://academic.oup.com/restud/article-abstract/41/5/29/1522050)
* add case (article, comment)Martin Czygan2021-01-143-0/+75
| | | | | currently, status is STRONG; having article and comments attached to a single work item might be useful
* case: translation in titleMartin Czygan2021-01-085-0/+177
|
* add casesMartin Czygan2021-01-043-3/+364
|
* add test caseMartin Czygan2021-01-043-0/+93
|
* fix casesMartin Czygan2020-12-291-9/+9
|
* add cases for a couple of reviewsMartin Czygan2020-12-2911-0/+441
|
* case: doi typoMartin Czygan2020-12-243-0/+111
|