Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | turn "match_release_fuzzy" into a class | Martin Czygan | 2021-11-16 | 5 | -84/+942 |
| | | | | | | | | Goal of this refactoring was to make the matching process a bit more configurable by using a class and a cascade of queries. For a limited test set: `FuzzyReleaseMatcher.match` is works the same as `match_release_fuzzy`. | ||||
* | use grobid_tei_xml for grobid unstructured lookups | Bryan Newbold | 2021-10-28 | 2 | -251/+23 |
| | |||||
* | matching: include contribs,files in release entity | Bryan Newbold | 2021-10-27 | 1 | -1/+1 |
| | | | | | | | | | | This makes several downstream applications simpler, like showing PDF links without an additional fatcat API fetch. The 'contrib' entities may be required as part of bibliographic matching (checking the creator names as well as the release-local versions of the name) In theory we could add webcaptures,filesets as well, but those are still rare, and occasionally result in very large sub-documents. | ||||
* | packaging: include py.typed for mypy to detect | Bryan Newbold | 2021-10-27 | 1 | -0/+0 |
| | |||||
* | start larger refactoring: remove cluster | Martin Czygan | 2021-09-24 | 6 | -524/+176 |
| | | | | | | | | | | | | | | | | | | background: verifying hundreds of millions of documents turned out to be a bit slow; anecdata: running clustering and verification over 1.8B inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for those operations. Also: with Go we do not need the extra GNU parallel wrapping. In any case, we aim for fuzzycat refactoring to provide: * better, more configurable verification and small scale matching * removal of batch clustering code (and improve refcat docs) * a place for a bit more generic, similarity based utils The most important piece in fuzzycat is a CSV file containing hand picked test examples for verification - and the code that is able to fulfill that test suite. We want to make this part more robust. | ||||
* | matching: run an additional es query for fuzzy matching | Martin Czygan | 2021-09-21 | 1 | -1/+73 |
| | |||||
* | style: apply formatting | Martin Czygan | 2021-09-21 | 5 | -8/+9 |
| | |||||
* | matching: actually return the specified number of results | Martin Czygan | 2021-09-15 | 1 | -2/+2 |
| | |||||
* | v0.1.22 | Martin Czygan | 2021-09-13 | 1 | -1/+1 |
| | |||||
* | remove dependency on fuzzy; use jellyfish | Martin Czygan | 2021-09-13 | 1 | -2/+2 |
| | |||||
* | update mentions of cgraph to refcat | Bryan Newbold | 2021-09-10 | 1 | -1/+1 |
| | |||||
* | sandcrawler slugify: lower-case greek ambiguity (OCR) | Bryan Newbold | 2021-07-01 | 1 | -2/+13 |
| | |||||
* | DOI clean/normalize helper; and use in verification etc | Bryan Newbold | 2021-07-01 | 4 | -5/+21 |
| | |||||
* | verify: page count parsing and comparison improvements | Bryan Newbold | 2021-07-01 | 2 | -4/+18 |
| | |||||
* | v0.1.21 | Martin Czygan | 2021-06-01 | 1 | -1/+1 |
| | |||||
* | lint: remove unused imports | Bryan Newbold | 2021-05-31 | 6 | -9/+1 |
| | |||||
* | matching: handle extid not found case (fatcat API HTTP 400 or 404) | Bryan Newbold | 2021-05-31 | 1 | -1/+7 |
| | |||||
* | v0.1.20 | Martin Czygan | 2021-04-15 | 1 | -1/+1 |
| | |||||
* | main: 'unstructured' CLI demo | Bryan Newbold | 2021-04-14 | 1 | -1/+38 |
| | |||||
* | add 'simple' high-level routines for fuzzy-match-and-verify calls | Bryan Newbold | 2021-04-14 | 1 | -0/+274 |
| | | | | | | | Some of these are a little redundant, in that calling code could trivially re-implement. However, I think these are good starters for stable external API interfaces, leaving us room to iterate and refactor lower-level implementations behind the scenes. | ||||
* | GROBID API unstructured citation parsing utility code | Bryan Newbold | 2021-04-14 | 2 | -1/+128 |
| | |||||
* | grobid2json helper file | Bryan Newbold | 2021-04-13 | 1 | -0/+212 |
| | | | | | This file has been passed around a couple times and should probably be published as a pypi.org project at some point. | ||||
* | dynaconf: switch to fuzzycat.config import across project | Bryan Newbold | 2021-04-13 | 2 | -1/+5 |
| | | | | This is the recommended way to use dynaconf. | ||||
* | make fmt | Bryan Newbold | 2021-04-13 | 1 | -0/+2 |
| | |||||
* | v0.1.19 | Martin Czygan | 2021-04-12 | 1 | -1/+1 |
| | |||||
* | address es hits.total change in ES7 | Martin Czygan | 2021-04-12 | 2 | -5/+18 |
| | | | | * https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html | ||||
* | v0.1.18 | Martin Czygan | 2021-03-16 | 1 | -1/+1 |
| | |||||
* | matching: a list is required | Martin Czygan | 2021-03-16 | 1 | -1/+1 |
| | |||||
* | v0.1.17 | Martin Czygan | 2021-02-19 | 1 | -1/+1 |
| | |||||
* | v0.1.16 | Martin Czygan | 2021-02-18 | 1 | -1/+1 |
| | |||||
* | v0.1.15 | Martin Czygan | 2021-02-18 | 1 | -1/+1 |
| | |||||
* | v0.1.14 | Martin Czygan | 2021-02-18 | 1 | -1/+1 |
| | |||||
* | workaround for a case found in refs: | Martin Czygan | 2021-02-15 | 1 | -0/+6 |
| | | | | | | | | | | | | | | | | | | * https://fatcat.wiki/release/2n7pyugxenb73gope52bn6m2ru * https://fatcat.wiki/release/p4bettvcszgn5d3zls5ogdjk4u Refs: Niaudet P. Steroid-sensitive idiopathic nephrotic syndrome in children. Pediatric Nephrology. 5th ed. Philadelphia: Lippincott Williams & Wilkins, 2004; pp 543–556. Doc: * https://fatcat.wiki/release/lc3d5q62zfa2rjyk2m7nr346nm, T-lymphocyte activation in steroid-sensitive nephrotic syndrome in childhood, by T J Neuhaus, V Shah, R E Callard, T M Barratt | ||||
* | fix list entries | Martin Czygan | 2021-02-13 | 1 | -2/+0 |
| | |||||
* | workaround for dates | Martin Czygan | 2021-02-11 | 1 | -0/+4 |
| | |||||
* | fix name | Martin Czygan | 2021-02-11 | 1 | -3/+3 |
| | |||||
* | update notes | Martin Czygan | 2021-02-11 | 2 | -10/+16 |
| | |||||
* | add a batch verifier for ref groups | Martin Czygan | 2021-02-11 | 2 | -0/+76 |
| | |||||
* | move initialization closer to use | Martin Czygan | 2021-02-02 | 1 | -1/+1 |
| | |||||
* | make fmt | Martin Czygan | 2021-02-02 | 1 | -5/+7 |
| | |||||
* | v0.1.13 | Martin Czygan | 2021-02-02 | 1 | -1/+1 |
| | |||||
* | fix line reading from bytes | Martin Czygan | 2021-02-02 | 1 | -3/+16 |
| | |||||
* | compression fixes and tweaks | Martin Czygan | 2021-02-02 | 1 | -7/+6 |
| | |||||
* | add shellout helper | Martin Czygan | 2021-02-02 | 3 | -2/+59 |
| | |||||
* | cleanup print | Martin Czygan | 2021-02-02 | 1 | -1/+0 |
| | |||||
* | add -C flag for compression | Martin Czygan | 2021-02-02 | 1 | -0/+2 |
| | |||||
* | add compress kwarg to cluster | Martin Czygan | 2021-02-02 | 2 | -14/+66 |
| | | | | Will compress intermediate results with zstd (https://git.io/Jt00y9). | ||||
* | v0.1.12 | Martin Czygan | 2021-01-12 | 1 | -1/+1 |
| | |||||
* | v0.1.11 | Martin Czygan | 2021-01-12 | 1 | -1/+1 |
| | |||||
* | format docs | Martin Czygan | 2021-01-09 | 1 | -1/+2 |
| |