aboutsummaryrefslogtreecommitdiffstats
path: root/fuzzycat
Commit message (Collapse)AuthorAgeFilesLines
* apply first round of feedback on matchingHEADmasterMartin Czygan2021-12-212-6/+59
|
* matching: track_total_hits, use FalseMartin Czygan2021-12-161-4/+4
| | | | integer, despite supported according to the docs, yielded a 400 parse-error
* matching: we do not need exact match countsMartin Czygan2021-12-161-4/+4
| | | | up to 100 or even will be ok; see also: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/search-your-data.html#track-total-hits
* matching: add hdl, remove mag idMartin Czygan2021-12-161-1/+1
|
* matching: cleanup and documentationMartin Czygan2021-12-071-47/+29
|
* matching: update docsMartin Czygan2021-12-071-9/+8
|
* v0.1.23Martin Czygan2021-12-061-1/+1
|
* complete FuzzyReleaseMatcher refactoringMartin Czygan2021-12-061-278/+201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We keep the name, since the api - "matcher.match(release)" - is the same; simplified queries; at most one query is performed against elasticsearch; parallel release retrieval from the API; optional support for release year windows; Test cases are expressed in yaml and will be auto-loaded from the specified directory; test work against the current search endpoint, which means the actual output may change on index updates; for the moment, we think this setup is relatively simple and not too unstable. about: title contrib, partial name input: > { "contribs": [ { "raw_name": "Adams" } ], "title": "digital libraries", "ext_ids": {} } release_year_padding: 1 expected: - 7rmvqtrb2jdyhcxxodihzzcugy - a2u6ougtsjcbvczou6sazsulcm - dy45vilej5diros6zmax46nm4e - exuwhhayird4fdjmmsiqpponlq - gqrj7jikezgcfpjfazhpf4e7c4 - mkmqt3453relbpuyktnmsg6hjq - t2g5sl3dgzchtnq7dynxyzje44 - t4tvenhrvzamraxrvvxivxmvga - wd3oeoi3bffknfbg2ymleqc4ja - y63a6dhrfnb7bltlxfynydbojy
* complete migration from away from match_release_fuzzyMartin Czygan2021-11-163-166/+6
| | | | | Instead, use `FuzzyReleaseMatcher.match`, which has approximately the same behavior.
* use elasticsearch <7.14 search argsMartin Czygan2021-11-161-11/+47
|
* turn "match_release_fuzzy" into a classMartin Czygan2021-11-165-84/+942
| | | | | | | | Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`.
* use grobid_tei_xml for grobid unstructured lookupsBryan Newbold2021-10-282-251/+23
|
* matching: include contribs,files in release entityBryan Newbold2021-10-271-1/+1
| | | | | | | | | | This makes several downstream applications simpler, like showing PDF links without an additional fatcat API fetch. The 'contrib' entities may be required as part of bibliographic matching (checking the creator names as well as the release-local versions of the name) In theory we could add webcaptures,filesets as well, but those are still rare, and occasionally result in very large sub-documents.
* packaging: include py.typed for mypy to detectBryan Newbold2021-10-271-0/+0
|
* start larger refactoring: remove clusterMartin Czygan2021-09-246-524/+176
| | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust.
* matching: run an additional es query for fuzzy matchingMartin Czygan2021-09-211-1/+73
|
* style: apply formattingMartin Czygan2021-09-215-8/+9
|
* matching: actually return the specified number of resultsMartin Czygan2021-09-151-2/+2
|
* v0.1.22Martin Czygan2021-09-131-1/+1
|
* remove dependency on fuzzy; use jellyfishMartin Czygan2021-09-131-2/+2
|
* update mentions of cgraph to refcatBryan Newbold2021-09-101-1/+1
|
* sandcrawler slugify: lower-case greek ambiguity (OCR)Bryan Newbold2021-07-011-2/+13
|
* DOI clean/normalize helper; and use in verification etcBryan Newbold2021-07-014-5/+21
|
* verify: page count parsing and comparison improvementsBryan Newbold2021-07-012-4/+18
|
* v0.1.21Martin Czygan2021-06-011-1/+1
|
* lint: remove unused importsBryan Newbold2021-05-316-9/+1
|
* matching: handle extid not found case (fatcat API HTTP 400 or 404)Bryan Newbold2021-05-311-1/+7
|
* v0.1.20Martin Czygan2021-04-151-1/+1
|
* main: 'unstructured' CLI demoBryan Newbold2021-04-141-1/+38
|
* add 'simple' high-level routines for fuzzy-match-and-verify callsBryan Newbold2021-04-141-0/+274
| | | | | | | Some of these are a little redundant, in that calling code could trivially re-implement. However, I think these are good starters for stable external API interfaces, leaving us room to iterate and refactor lower-level implementations behind the scenes.
* GROBID API unstructured citation parsing utility codeBryan Newbold2021-04-142-1/+128
|
* grobid2json helper fileBryan Newbold2021-04-131-0/+212
| | | | | This file has been passed around a couple times and should probably be published as a pypi.org project at some point.
* dynaconf: switch to fuzzycat.config import across projectBryan Newbold2021-04-132-1/+5
| | | | This is the recommended way to use dynaconf.
* make fmtBryan Newbold2021-04-131-0/+2
|
* v0.1.19Martin Czygan2021-04-121-1/+1
|
* address es hits.total change in ES7Martin Czygan2021-04-122-5/+18
| | | | * https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html
* v0.1.18Martin Czygan2021-03-161-1/+1
|
* matching: a list is requiredMartin Czygan2021-03-161-1/+1
|
* v0.1.17Martin Czygan2021-02-191-1/+1
|
* v0.1.16Martin Czygan2021-02-181-1/+1
|
* v0.1.15Martin Czygan2021-02-181-1/+1
|
* v0.1.14Martin Czygan2021-02-181-1/+1
|
* workaround for a case found in refs:Martin Czygan2021-02-151-0/+6
| | | | | | | | | | | | | | | | | | * https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru * https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Refs: Niaudet P. Steroid-sensitive idiopathic nephrotic syndrome in children. Pediatric Nephrology. 5th ed. Philadelphia: Lippincott Williams & Wilkins, 2004; pp 543–556. Doc: * https://fatcat.wiki/release/lc3d5q62zfa2rjyk2m7nr346nm, T-lymphocyte activation in steroid-sensitive nephrotic syndrome in childhood, by T J Neuhaus, V Shah, R E Callard, T M Barratt
* fix list entriesMartin Czygan2021-02-131-2/+0
|
* workaround for datesMartin Czygan2021-02-111-0/+4
|
* fix nameMartin Czygan2021-02-111-3/+3
|
* update notesMartin Czygan2021-02-112-10/+16
|
* add a batch verifier for ref groupsMartin Czygan2021-02-112-0/+76
|
* move initialization closer to useMartin Czygan2021-02-021-1/+1
|
* make fmtMartin Czygan2021-02-021-5/+7
|