| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
| |
integer, despite supported according to the docs, yielded a 400 parse-error
|
|
|
|
| |
up to 100 or even will be ok; see also: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/search-your-data.html#track-total-hits
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We keep the name, since the api - "matcher.match(release)" - is the
same; simplified queries; at most one query is performed against
elasticsearch; parallel release retrieval from the API; optional support
for release year windows;
Test cases are expressed in yaml and will be auto-loaded from the
specified directory; test work against the current search endpoint,
which means the actual output may change on index updates; for the
moment, we think this setup is relatively simple and not too unstable.
about: title contrib, partial name
input: >
{
"contribs": [
{
"raw_name": "Adams"
}
],
"title": "digital libraries",
"ext_ids": {}
}
release_year_padding: 1
expected:
- 7rmvqtrb2jdyhcxxodihzzcugy
- a2u6ougtsjcbvczou6sazsulcm
- dy45vilej5diros6zmax46nm4e
- exuwhhayird4fdjmmsiqpponlq
- gqrj7jikezgcfpjfazhpf4e7c4
- mkmqt3453relbpuyktnmsg6hjq
- t2g5sl3dgzchtnq7dynxyzje44
- t4tvenhrvzamraxrvvxivxmvga
- wd3oeoi3bffknfbg2ymleqc4ja
- y63a6dhrfnb7bltlxfynydbojy
|
|
|
|
|
| |
Instead, use `FuzzyReleaseMatcher.match`, which has approximately the
same behavior.
|
| |
|
|\
| |
| |
| |
| | |
turn "match_release_fuzzy" into a class
See merge request webgroup/fuzzycat!10
|
| | |
|
| | |
|
| | |
|
|/
|
|
|
|
|
|
| |
Goal of this refactoring was to make the matching process a bit more
configurable by using a class and a cascade of queries.
For a limited test set: `FuzzyReleaseMatcher.match` is works the same as
`match_release_fuzzy`.
|
|\
| |
| |
| |
| | |
use grobid_tei_xml for grobid unstructured lookups
See merge request webgroup/fuzzycat!9
|
|/ |
|
|\
| |
| |
| |
| | |
tweaks to deps and packaging; add files,contribs in live match release lookups
See merge request webgroup/fuzzycat!8
|
| |
| |
| |
| |
| |
| | |
There isn't any new feature required in the new version of the client
library, but feels like we should aggressively update everywhere when
possible.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This makes several downstream applications simpler, like showing PDF
links without an additional fatcat API fetch. The 'contrib' entities may
be required as part of bibliographic matching (checking the creator
names as well as the release-local versions of the name)
In theory we could add webcaptures,filesets as well, but those are still
rare, and occasionally result in very large sub-documents.
|
| | |
|
|/
|
|
|
| |
This is to avoid 'elasticsearch.exceptions.UnsupportedProductError'
errors in newer versions of the elasticsearch client libraries.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
background: verifying hundreds of millions of documents turned out to be
a bit slow; anecdata: running clustering and verification over 1.8B
inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for
those operations. Also: with Go we do not need the extra GNU parallel
wrapping.
In any case, we aim for fuzzycat refactoring to provide:
* better, more configurable verification and small scale matching
* removal of batch clustering code (and improve refcat docs)
* a place for a bit more generic, similarity based utils
The most important piece in fuzzycat is a CSV file containing hand
picked test examples for verification - and the code that is able to
fulfill that test suite. We want to make this part more robust.
|
| |
|
|\
| |
| |
| |
| | |
review notes and some cleanup
See merge request webgroup/fuzzycat!7
|
| |
| |
| |
| |
| | |
We want to first move to elasticsearch dsl and will reactivate and
extends after refactoring.
|
| | |
|
| | |
|
|/ |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
fuzzycat is mostly a library; the command line tool will switch to a
bundled executable (e.g. via shiv) soon;
removed pipenv in order to lower confusion which setup to use; also
pipenv unfortunately at time cat take a bit of time to complete
operations
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| | |
* 'master' of git.archive.org:webgroup/fuzzycat:
simplify README for general audience; move some content to notes
sandcrawler slugify: lower-case greek ambiguity (OCR)
DOI clean/normalize helper; and use in verification etc
verify: page count parsing and comparison improvements
|
| |\
| | |
| | |
| | |
| | | |
simplify README for general audience; move some content to notes
See merge request webgroup/fuzzycat!6
|
| | | |
|
| |\ \
| | |/
| |/|
| | |
| | | |
verify improvements
See merge request webgroup/fuzzycat!4
|
| | | |
|
| | | |
|
| |/ |
|
| | |
|
| | |
|
|/ |
|
| |
|