aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* matching: include contribs,files in release entityBryan Newbold2021-10-271-1/+1
| | | | | | | | | | This makes several downstream applications simpler, like showing PDF links without an additional fatcat API fetch. The 'contrib' entities may be required as part of bibliographic matching (checking the creator names as well as the release-local versions of the name) In theory we could add webcaptures,filesets as well, but those are still rare, and occasionally result in very large sub-documents.
* packaging: include py.typed for mypy to detectBryan Newbold2021-10-272-0/+1
|
* deps: pin elasticsearch to less than 7.14Bryan Newbold2021-10-271-1/+2
| | | | | This is to avoid 'elasticsearch.exceptions.UnsupportedProductError' errors in newer versions of the elasticsearch client libraries.
* start larger refactoring: remove clusterMartin Czygan2021-09-249-723/+188
| | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
* setup: narrow dependency versionsMartin Czygan2021-09-211-4/+4
|
* Merge branch 'wip-martin-review-cleanup' into 'master'Martin Czygan2021-09-2113-20/+272
|\ | | | | | | | | review notes and some cleanup See merge request webgroup/fuzzycat!7
| * tests: temporarily disable testsMartin Czygan2021-09-211-12/+12
| | | | | | | | | | We want to first move to elasticsearch dsl and will reactivate and extends after refactoring.
| * matching: run an additional es query for fuzzy matchingMartin Czygan2021-09-212-3/+93
| |
| * reorganize notesMartin Czygan2021-09-216-2/+153
| |
| * style: apply formattingMartin Czygan2021-09-217-11/+22
|/
* matching: actually return the specified number of resultsMartin Czygan2021-09-151-2/+2
|
* add todoMartin Czygan2021-09-141-0/+28
|
* remove pipenv related filesMartin Czygan2021-09-135-979/+25
| | | | | | | | | fuzzycat is mostly a library; the command line tool will switch to a bundled executable (e.g. via shiv) soon; removed pipenv in order to lower confusion which setup to use; also pipenv unfortunately at time cat take a bit of time to complete operations
* v0.1.22Martin Czygan2021-09-131-1/+1
|
* cluster: adjust tests to jellyfish nysiis implementationMartin Czygan2021-09-131-7/+7
|
* update READMEMartin Czygan2021-09-131-4/+7
|
* remove dependency on fuzzy; use jellyfishMartin Czygan2021-09-134-304/+286
|
* cleanup makefileMartin Czygan2021-09-131-2/+0
|
* update mentions of cgraph to refcatBryan Newbold2021-09-102-2/+2
|
* Merge branch 'master' of git.archive.org:webgroup/fuzzycatMartin Czygan2021-07-098-224/+318
|\ | | | | | | | | | | | | | | * 'master' of git.archive.org:webgroup/fuzzycat: simplify README for general audience; move some content to notes sandcrawler slugify: lower-case greek ambiguity (OCR) DOI clean/normalize helper; and use in verification etc verify: page count parsing and comparison improvements
| * Merge branch 'bnewbold-readme' into 'master'Martin Czygan2021-07-072-210/+245
| |\ | | | | | | | | | | | | simplify README for general audience; move some content to notes See merge request webgroup/fuzzycat!6
| | * simplify README for general audience; move some content to notesBryan Newbold2021-07-012-210/+245
| | |
| * | Merge branch 'bnewbold-verify-improvements' into 'master'Martin Czygan2021-07-026-14/+73
| |\ \ | | |/ | |/| | | | | | | verify improvements See merge request webgroup/fuzzycat!4
| | * sandcrawler slugify: lower-case greek ambiguity (OCR)Bryan Newbold2021-07-011-2/+13
| | |
| | * DOI clean/normalize helper; and use in verification etcBryan Newbold2021-07-015-6/+35
| | |
| | * verify: page count parsing and comparison improvementsBryan Newbold2021-07-013-6/+25
| |/
* | add a few (open) tests casesMartin Czygan2021-07-096-0/+176
| |
* | notes on matching metricsMartin Czygan2021-07-081-0/+16
| |
* | cleanup notesMartin Czygan2021-07-082-13/+0
|/
* add test caseMartin Czygan2021-06-214-0/+1339
|
* v0.1.21Martin Czygan2021-06-011-1/+1
|
* Merge branch 'bnewbold-bugfixes' into 'master'Martin Czygan2021-06-019-86/+110
|\ | | | | | | | | fix tests; dynaconf dependency; handle fatcat API release lookup 404 See merge request webgroup/fuzzycat!3
| * lint: remove unused importsBryan Newbold2021-05-317-10/+1
| |
| * rebuild Pipefile.lock, for 'fuzzy' depBryan Newbold2021-05-311-75/+101
| | | | | | | | | | | | | | | | | | | | Somehow the 'fuzzy' library was marked in the lockfile as a local, editable dependency (like fuzzycat itself). Deleted the lockfile and re-build (pipenv lock) to indicate that it should be an actual pypi library. This also bumps all dependency versions, but that seems safe at the moment.
| * setup.py: express dynaconf dependencyBryan Newbold2021-05-311-0/+1
| |
| * matching: handle extid not found case (fatcat API HTTP 400 or 404)Bryan Newbold2021-05-311-1/+7
|/
* add test caseMartin Czygan2021-05-263-0/+83
|
* add testMartin Czygan2021-05-123-0/+603
|
* add test casesMartin Czygan2021-05-0610-0/+1861
|
* add test caseMartin Czygan2021-04-203-0/+107
|
* ignore pyproject.tomlMartin Czygan2021-04-171-0/+3
|
* update lock fileMartin Czygan2021-04-171-156/+184
|
* add testMartin Czygan2021-04-173-0/+1982
|
* v0.1.20Martin Czygan2021-04-151-1/+1
|
* addess #2Martin Czygan2021-04-152-0/+4
|
* Merge branch 'bnewbold-upstreaming' into 'master'Martin Czygan2021-04-156-1/+823
|\ | | | | | | | | refactoring/upstreaming fuzzycat "live" matching helpers See merge request webgroup/fuzzycat!2
| * main: 'unstructured' CLI demoBryan Newbold2021-04-141-1/+38
| |
| * add 'simple' high-level routines for fuzzy-match-and-verify callsBryan Newbold2021-04-142-0/+316
| | | | | | | | | | | | | | Some of these are a little redundant, in that calling code could trivially re-implement. However, I think these are good starters for stable external API interfaces, leaving us room to iterate and refactor lower-level implementations behind the scenes.
| * GROBID API unstructured citation parsing utility codeBryan Newbold2021-04-143-1/+258
| |
| * grobid2json helper fileBryan Newbold2021-04-131-0/+212
| | | | | | | | | | This file has been passed around a couple times and should probably be published as a pypi.org project at some point.