Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | turn "match_release_fuzzy" into a class | Martin Czygan | 2021-11-16 | 16 | -12/+323 |
| | | | | | | | | Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`. | ||||
* | use grobid_tei_xml for grobid unstructured lookups | Bryan Newbold | 2021-10-28 | 1 | -26/+32 |
| | |||||
* | start larger refactoring: remove cluster | Martin Czygan | 2021-09-24 | 3 | -199/+12 |
| | | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust. | ||||
* | tests: temporarily disable tests | Martin Czygan | 2021-09-21 | 1 | -12/+12 |
| | | | | | We want to first move to elasticsearch dsl and will reactivate and extends after refactoring. | ||||
* | matching: run an additional es query for fuzzy matching | Martin Czygan | 2021-09-21 | 1 | -2/+20 |
| | |||||
* | style: apply formatting | Martin Czygan | 2021-09-21 | 2 | -3/+13 |
| | |||||
* | cluster: adjust tests to jellyfish nysiis implementation | Martin Czygan | 2021-09-13 | 1 | -7/+7 |
| | |||||
* | Merge branch 'master' of git.archive.org:webgroup/fuzzycat | Martin Czygan | 2021-07-09 | 1 | -3/+21 |
|\ | | | | | | | | | | | | | | | * 'master' of git.archive.org:webgroup/fuzzycat: simplify README for general audience; move some content to notes sandcrawler slugify: lower-case greek ambiguity (OCR) DOI clean/normalize helper; and use in verification etc verify: page count parsing and comparison improvements | ||||
| * | DOI clean/normalize helper; and use in verification etc | Bryan Newbold | 2021-07-01 | 1 | -1/+14 |
| | | |||||
| * | verify: page count parsing and comparison improvements | Bryan Newbold | 2021-07-01 | 1 | -2/+7 |
| | | |||||
* | | add a few (open) tests cases | Martin Czygan | 2021-07-09 | 6 | -0/+176 |
|/ | |||||
* | add test case | Martin Czygan | 2021-06-21 | 4 | -0/+1339 |
| | |||||
* | lint: remove unused imports | Bryan Newbold | 2021-05-31 | 1 | -1/+0 |
| | |||||
* | add test case | Martin Czygan | 2021-05-26 | 3 | -0/+83 |
| | |||||
* | add test | Martin Czygan | 2021-05-12 | 3 | -0/+603 |
| | |||||
* | add test cases | Martin Czygan | 2021-05-06 | 10 | -0/+1861 |
| | |||||
* | add test case | Martin Czygan | 2021-04-20 | 3 | -0/+107 |
| | |||||
* | add test | Martin Czygan | 2021-04-17 | 3 | -0/+1982 |
| | |||||
* | Merge branch 'bnewbold-upstreaming' into 'master' | Martin Czygan | 2021-04-15 | 2 | -0/+172 |
|\ | | | | | | | | | refactoring/upstreaming fuzzycat "live" matching helpers See merge request webgroup/fuzzycat!2 | ||||
| * | add 'simple' high-level routines for fuzzy-match-and-verify calls | Bryan Newbold | 2021-04-14 | 1 | -0/+42 |
| | | | | | | | | | | | | | | Some of these are a little redundant, in that calling code could trivially re-implement. However, I think these are good starters for stable external API interfaces, leaving us room to iterate and refactor lower-level implementations behind the scenes. | ||||
| * | GROBID API unstructured citation parsing utility code | Bryan Newbold | 2021-04-14 | 1 | -0/+130 |
| | | |||||
* | | cleanup merge artifact | Martin Czygan | 2021-04-15 | 1 | -1/+0 |
| | | |||||
* | | Merge branch 'bnewbold-dev-setup' | Martin Czygan | 2021-04-15 | 1 | -1/+8 |
|\| | | | | | | | | | | | | | | | | | | | * bnewbold-dev-setup: dynaconf: switch to fuzzycat.config import across project upgrade to python3.8 gitlab CI: try 'make deps' and 'make test' makefile: run common commands inside pipenv makefile: change 'deps' to be simple --dev --deploy make fmt | ||||
| * | dynaconf: switch to fuzzycat.config import across project | Bryan Newbold | 2021-04-13 | 1 | -2/+1 |
| | | | | | | | | This is the recommended way to use dynaconf. | ||||
| * | make fmt | Bryan Newbold | 2021-04-13 | 1 | -3/+14 |
| | | |||||
* | | fix imports and formatting | Martin Czygan | 2021-04-14 | 2 | -8/+26 |
| | | |||||
* | | test: skip if configured search server is not reachable | Martin Czygan | 2021-04-14 | 1 | -0/+14 |
| | | |||||
* | | tests: run es tests against public search endpoint | Martin Czygan | 2021-04-14 | 1 | -8/+31 |
|/ | |||||
* | address es hits.total change in ES7 | Martin Czygan | 2021-04-12 | 1 | -1/+10 |
| | | | | * https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html | ||||
* | add compress kwarg to cluster | Martin Czygan | 2021-02-02 | 8 | -8/+31 |
| | | | | Will compress intermediate results with zstd (https://git.io/Jt00y9). | ||||
* | add case | Martin Czygan | 2021-02-01 | 3 | -0/+171 |
| | |||||
* | add case; | Martin Czygan | 2021-01-29 | 3 | -0/+132 |
| | | | | different, but related; verify says: "strong" | ||||
* | add case | Martin Czygan | 2021-01-28 | 5 | -0/+710 |
| | |||||
* | add case | Martin Czygan | 2021-01-27 | 3 | -0/+440 |
| | |||||
* | add case | Martin Czygan | 2021-01-27 | 3 | -0/+268 |
| | | | | same DOI, but repeated slash | ||||
* | add case | Martin Czygan | 2021-01-26 | 3 | -0/+156 |
| | |||||
* | add case; probably similar but yields different | Martin Czygan | 2021-01-23 | 3 | -0/+87 |
| | |||||
* | add case | Martin Czygan | 2021-01-20 | 3 | -0/+139 |
| | |||||
* | add case | Martin Czygan | 2021-01-15 | 3 | -0/+786 |
| | |||||
* | add case | Martin Czygan | 2021-01-15 | 3 | -0/+422 |
| | |||||
* | add case; article republished | Martin Czygan | 2021-01-15 | 3 | -0/+79 |
| | | | | | | unfortunately, md is partial as for page count (e.g. "29" in md, but "29-45" on publisher site: https://academic.oup.com/restud/article-abstract/41/5/29/1522050) | ||||
* | add case (article, comment) | Martin Czygan | 2021-01-14 | 3 | -0/+75 |
| | | | | | currently, status is STRONG; having article and comments attached to a single work item might be useful | ||||
* | case: translation in title | Martin Czygan | 2021-01-08 | 5 | -0/+177 |
| | |||||
* | add cases | Martin Czygan | 2021-01-04 | 3 | -3/+364 |
| | |||||
* | add test case | Martin Czygan | 2021-01-04 | 3 | -0/+93 |
| | |||||
* | fix cases | Martin Czygan | 2020-12-29 | 1 | -9/+9 |
| | |||||
* | add cases for a couple of reviews | Martin Czygan | 2020-12-29 | 11 | -0/+441 |
| | |||||
* | case: doi typo | Martin Czygan | 2020-12-24 | 3 | -0/+111 |
| | |||||
* | add cases | Martin Czygan | 2020-12-24 | 4 | -0/+817 |
| | |||||
* | add article, dataset pair | Martin Czygan | 2020-12-24 | 3 | -0/+1013 |
| |