aboutsummaryrefslogtreecommitdiffstats
path: root/fuzzycat/matching.py
Commit message (Collapse)AuthorAgeFilesLines
* matching: include contribs,files in release entityBryan Newbold2021-10-271-1/+1
| | | | | | | | | | This makes several downstream applications simpler, like showing PDF links without an additional fatcat API fetch. The 'contrib' entities may be required as part of bibliographic matching (checking the creator names as well as the release-local versions of the name) In theory we could add webcaptures,filesets as well, but those are still rare, and occasionally result in very large sub-documents.
* start larger refactoring: remove clusterMartin Czygan2021-09-241-2/+0
| | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
* matching: run an additional es query for fuzzy matchingMartin Czygan2021-09-211-1/+73
|
* style: apply formattingMartin Czygan2021-09-211-1/+2
|
* matching: actually return the specified number of resultsMartin Czygan2021-09-151-2/+2
|
* lint: remove unused importsBryan Newbold2021-05-311-1/+0
|
* matching: handle extid not found case (fatcat API HTTP 400 or 404)Bryan Newbold2021-05-311-1/+7
|
* dynaconf: switch to fuzzycat.config import across projectBryan Newbold2021-04-131-1/+1
| | | | This is the recommended way to use dynaconf.
* address es hits.total change in ES7Martin Czygan2021-04-121-4/+5
| | | | * https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html
* matching: a list is requiredMartin Czygan2021-03-161-1/+1
|
* [testing] use api for id lookupsMartin Czygan2020-12-231-20/+21
|
* inject configurationMartin Czygan2020-12-231-1/+5
|
* matching: fix importMartin Czygan2020-12-211-0/+1
|
* update docsMartin Czygan2020-12-191-0/+2
|
* update notesMartin Czygan2020-12-171-0/+1
|
* apply style fixesMartin Czygan2020-12-171-8/+4
|
* update docsMartin Czygan2020-12-171-3/+4
|
* pass through apiMartin Czygan2020-12-171-9/+13
|
* add missing functionMartin Czygan2020-12-161-1/+59
|
* docs and release match commandMartin Czygan2020-12-161-1/+1
|
* matching stubMartin Czygan2020-12-151-6/+71
|
* include matching (stub)Martin Czygan2020-12-151-0/+91
|
* large overhaulMartin Czygan2020-08-171-147/+0
| | | | | | * separate all fatcat related code into fatcat submodule * more type annotations * add verify_serial_name for journal names
* adjust formattingMartin Czygan2020-08-121-1/+2
|
* fix importsMartin Czygan2020-08-121-1/+1
|
* improve docs and importsMartin Czygan2020-08-121-9/+8
|
* try: all matching methods should start with matchMartin Czygan2020-08-121-1/+1
|
* add matching submoduleMartin Czygan2020-08-121-0/+147