Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | use elasticsearch <7.14 search args | Martin Czygan | 2021-11-16 | 1 | -11/+47 |
| | |||||
* | setup: add missing pyyaml dependency | Martin Czygan | 2021-11-16 | 1 | -0/+1 |
| | |||||
* | setup: add thefuzz dependency | Martin Czygan | 2021-11-16 | 1 | -1/+2 |
| | |||||
* | turn "match_release_fuzzy" into a class | Martin Czygan | 2021-11-16 | 23 | -111/+1371 |
| | | | | | | | | Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`. | ||||
* | Merge branch 'bnewbold-grobid-tei-xml' into 'master' | Martin Czygan | 2021-11-04 | 4 | -277/+56 |
|\ | | | | | | | | | use grobid_tei_xml for grobid unstructured lookups See merge request webgroup/fuzzycat!9 | ||||
| * | use grobid_tei_xml for grobid unstructured lookups | Bryan Newbold | 2021-10-28 | 4 | -277/+56 |
|/ | |||||
* | Merge branch 'bnewbold-tweaks' into 'master' | Martin Czygan | 2021-10-28 | 3 | -3/+5 |
|\ | | | | | | | | | tweaks to deps and packaging; add files,contribs in live match release lookups See merge request webgroup/fuzzycat!8 | ||||
| * | bump fatcat-openapi-client version to 0.4.0 | Bryan Newbold | 2021-10-27 | 1 | -1/+1 |
| | | | | | | | | | | | | There isn't any new feature required in the new version of the client library, but feels like we should aggressively update everywhere when possible. | ||||
| * | matching: include contribs,files in release entity | Bryan Newbold | 2021-10-27 | 1 | -1/+1 |
| | | | | | | | | | | | | | | | | | | | | This makes several downstream applications simpler, like showing PDF links without an additional fatcat API fetch. The 'contrib' entities may be required as part of bibliographic matching (checking the creator names as well as the release-local versions of the name) In theory we could add webcaptures,filesets as well, but those are still rare, and occasionally result in very large sub-documents. | ||||
| * | packaging: include py.typed for mypy to detect | Bryan Newbold | 2021-10-27 | 2 | -0/+1 |
| | | |||||
| * | deps: pin elasticsearch to less than 7.14 | Bryan Newbold | 2021-10-27 | 1 | -1/+2 |
|/ | | | | | This is to avoid 'elasticsearch.exceptions.UnsupportedProductError' errors in newer versions of the elasticsearch client libraries. | ||||
* | start larger refactoring: remove cluster | Martin Czygan | 2021-09-24 | 9 | -723/+188 |
| | | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust. | ||||
* | setup: narrow dependency versions | Martin Czygan | 2021-09-21 | 1 | -4/+4 |
| | |||||
* | Merge branch 'wip-martin-review-cleanup' into 'master' | Martin Czygan | 2021-09-21 | 13 | -20/+272 |
|\ | | | | | | | | | review notes and some cleanup See merge request webgroup/fuzzycat!7 | ||||
| * | tests: temporarily disable tests | Martin Czygan | 2021-09-21 | 1 | -12/+12 |
| | | | | | | | | | | We want to first move to elasticsearch dsl and will reactivate and extends after refactoring. | ||||
| * | matching: run an additional es query for fuzzy matching | Martin Czygan | 2021-09-21 | 2 | -3/+93 |
| | | |||||
| * | reorganize notes | Martin Czygan | 2021-09-21 | 6 | -2/+153 |
| | | |||||
| * | style: apply formatting | Martin Czygan | 2021-09-21 | 7 | -11/+22 |
|/ | |||||
* | matching: actually return the specified number of results | Martin Czygan | 2021-09-15 | 1 | -2/+2 |
| | |||||
* | add todo | Martin Czygan | 2021-09-14 | 1 | -0/+28 |
| | |||||
* | remove pipenv related files | Martin Czygan | 2021-09-13 | 5 | -979/+25 |
| | | | | | | | | | fuzzycat is mostly a library; the command line tool will switch to a bundled executable (e.g. via shiv) soon; removed pipenv in order to lower confusion which setup to use; also pipenv unfortunately at time cat take a bit of time to complete operations | ||||
* | v0.1.22 | Martin Czygan | 2021-09-13 | 1 | -1/+1 |
| | |||||
* | cluster: adjust tests to jellyfish nysiis implementation | Martin Czygan | 2021-09-13 | 1 | -7/+7 |
| | |||||
* | update README | Martin Czygan | 2021-09-13 | 1 | -4/+7 |
| | |||||
* | remove dependency on fuzzy; use jellyfish | Martin Czygan | 2021-09-13 | 4 | -304/+286 |
| | |||||
* | cleanup makefile | Martin Czygan | 2021-09-13 | 1 | -2/+0 |
| | |||||
* | update mentions of cgraph to refcat | Bryan Newbold | 2021-09-10 | 2 | -2/+2 |
| | |||||
* | Merge branch 'master' of git.archive.org:webgroup/fuzzycat | Martin Czygan | 2021-07-09 | 8 | -224/+318 |
|\ | | | | | | | | | | | | | | | * 'master' of git.archive.org:webgroup/fuzzycat: simplify README for general audience; move some content to notes sandcrawler slugify: lower-case greek ambiguity (OCR) DOI clean/normalize helper; and use in verification etc verify: page count parsing and comparison improvements | ||||
| * | Merge branch 'bnewbold-readme' into 'master' | Martin Czygan | 2021-07-07 | 2 | -210/+245 |
| |\ | | | | | | | | | | | | | simplify README for general audience; move some content to notes See merge request webgroup/fuzzycat!6 | ||||
| | * | simplify README for general audience; move some content to notes | Bryan Newbold | 2021-07-01 | 2 | -210/+245 |
| | | | |||||
| * | | Merge branch 'bnewbold-verify-improvements' into 'master' | Martin Czygan | 2021-07-02 | 6 | -14/+73 |
| |\ \ | | |/ | |/| | | | | | | | verify improvements See merge request webgroup/fuzzycat!4 | ||||
| | * | sandcrawler slugify: lower-case greek ambiguity (OCR) | Bryan Newbold | 2021-07-01 | 1 | -2/+13 |
| | | | |||||
| | * | DOI clean/normalize helper; and use in verification etc | Bryan Newbold | 2021-07-01 | 5 | -6/+35 |
| | | | |||||
| | * | verify: page count parsing and comparison improvements | Bryan Newbold | 2021-07-01 | 3 | -6/+25 |
| |/ | |||||
* | | add a few (open) tests cases | Martin Czygan | 2021-07-09 | 6 | -0/+176 |
| | | |||||
* | | notes on matching metrics | Martin Czygan | 2021-07-08 | 1 | -0/+16 |
| | | |||||
* | | cleanup notes | Martin Czygan | 2021-07-08 | 2 | -13/+0 |
|/ | |||||
* | add test case | Martin Czygan | 2021-06-21 | 4 | -0/+1339 |
| | |||||
* | v0.1.21 | Martin Czygan | 2021-06-01 | 1 | -1/+1 |
| | |||||
* | Merge branch 'bnewbold-bugfixes' into 'master' | Martin Czygan | 2021-06-01 | 9 | -86/+110 |
|\ | | | | | | | | | fix tests; dynaconf dependency; handle fatcat API release lookup 404 See merge request webgroup/fuzzycat!3 | ||||
| * | lint: remove unused imports | Bryan Newbold | 2021-05-31 | 7 | -10/+1 |
| | | |||||
| * | rebuild Pipefile.lock, for 'fuzzy' dep | Bryan Newbold | 2021-05-31 | 1 | -75/+101 |
| | | | | | | | | | | | | | | | | | | | | Somehow the 'fuzzy' library was marked in the lockfile as a local, editable dependency (like fuzzycat itself). Deleted the lockfile and re-build (pipenv lock) to indicate that it should be an actual pypi library. This also bumps all dependency versions, but that seems safe at the moment. | ||||
| * | setup.py: express dynaconf dependency | Bryan Newbold | 2021-05-31 | 1 | -0/+1 |
| | | |||||
| * | matching: handle extid not found case (fatcat API HTTP 400 or 404) | Bryan Newbold | 2021-05-31 | 1 | -1/+7 |
|/ | |||||
* | add test case | Martin Czygan | 2021-05-26 | 3 | -0/+83 |
| | |||||
* | add test | Martin Czygan | 2021-05-12 | 3 | -0/+603 |
| | |||||
* | add test cases | Martin Czygan | 2021-05-06 | 10 | -0/+1861 |
| | |||||
* | add test case | Martin Czygan | 2021-04-20 | 3 | -0/+107 |
| | |||||
* | ignore pyproject.toml | Martin Czygan | 2021-04-17 | 1 | -0/+3 |
| | |||||
* | update lock file | Martin Czygan | 2021-04-17 | 1 | -156/+184 |
| |