aboutsummaryrefslogtreecommitdiffstats
path: root/fuzzycat/utils.py
Commit message (Collapse)AuthorAgeFilesLines
* apply first round of feedback on matchingHEADmasterMartin Czygan2021-12-211-0/+7
|
* start larger refactoring: remove clusterMartin Czygan2021-09-241-5/+10
| | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
* style: apply formattingMartin Czygan2021-09-211-0/+2
|
* DOI clean/normalize helper; and use in verification etcBryan Newbold2021-07-011-0/+14
|
* verify: page count parsing and comparison improvementsBryan Newbold2021-07-011-3/+15
|
* lint: remove unused importsBryan Newbold2021-05-311-1/+0
|
* fix imports and formattingMartin Czygan2021-04-141-0/+2
|
* address es hits.total change in ES7Martin Czygan2021-04-121-1/+13
| | | | * https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html
* move initialization closer to useMartin Czygan2021-02-021-1/+1
|
* fix line reading from bytesMartin Czygan2021-02-021-3/+16
|
* add shellout helperMartin Czygan2021-02-021-0/+54
|
* add compress kwarg to clusterMartin Czygan2021-02-021-0/+22
| | | | Will compress intermediate results with zstd (https://git.io/Jt00y9).
* add casesMartin Czygan2021-01-041-0/+7
|
* fix cmdline toolMartin Czygan2020-12-151-10/+8
|
* fix verificationMartin Czygan2020-12-151-5/+6
|
* single item verificationMartin Czygan2020-12-151-0/+34
|
* verify: move out some code to utilsMartin Czygan2020-12-141-2/+22
|
* update docsMartin Czygan2020-12-121-1/+6
|
* fix importsMartin Czygan2020-12-121-1/+1
|
* add generic doi version caseMartin Czygan2020-12-111-0/+14
|
* verify: bsi undatedMartin Czygan2020-12-011-0/+9
|
* add another caseMartin Czygan2020-12-011-1/+1
|
* figshare fixMartin Czygan2020-11-261-2/+2
|
* add another test caseMartin Czygan2020-11-251-2/+2
|
* move helpers to utilsMartin Czygan2020-11-251-0/+23
|
* move enums into commonMartin Czygan2020-11-251-2/+2
|
* apply formattingMartin Czygan2020-11-251-2/+2
|
* extend testsMartin Czygan2020-11-251-0/+43
|
* extend test coverageMartin Czygan2020-11-251-0/+26
|
* cleanupMartin Czygan2020-10-211-249/+0
|
* large overhaulMartin Czygan2020-08-171-2/+2
| | | | | | * separate all fatcat related code into fatcat submodule * more type annotations * add verify_serial_name for journal names
* cleanup handling: add parameterMartin Czygan2020-08-151-0/+3
| | | | allow string cleanup be called directly
* issn: generate a name to issn mappingMartin Czygan2020-08-121-0/+16
| | | | | | | | | | This allows to make suggestions about potentially ambiguous titles. Maybe suggest a minimal length. Ultimately, there are only about 2M journal titles. If an arbitrary string must match a journal title (not a generic container title), then we can use a combination of direct lookup; plus some extra processing based on this dataset.
* switch to yapfMartin Czygan2020-08-121-3/+0
|
* utils: fix importsMartin Czygan2020-08-121-1/+1
|
* import utility functionsMartin Czygan2020-08-121-0/+151
|
* add basic str utilsMartin Czygan2020-08-121-0/+82