| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We keep the name, since the api - "matcher.match(release)" - is the
same; simplified queries; at most one query is performed against
elasticsearch; parallel release retrieval from the API; optional support
for release year windows;
Test cases are expressed in yaml and will be auto-loaded from the
specified directory; test work against the current search endpoint,
which means the actual output may change on index updates; for the
moment, we think this setup is relatively simple and not too unstable.
about: title contrib, partial name
input: >
{
"contribs": [
{
"raw_name": "Adams"
}
],
"title": "digital libraries",
"ext_ids": {}
}
release_year_padding: 1
expected:
- 7rmvqtrb2jdyhcxxodihzzcugy
- a2u6ougtsjcbvczou6sazsulcm
- dy45vilej5diros6zmax46nm4e
- exuwhhayird4fdjmmsiqpponlq
- gqrj7jikezgcfpjfazhpf4e7c4
- mkmqt3453relbpuyktnmsg6hjq
- t2g5sl3dgzchtnq7dynxyzje44
- t4tvenhrvzamraxrvvxivxmvga
- wd3oeoi3bffknfbg2ymleqc4ja
- y63a6dhrfnb7bltlxfynydbojy
|
|
|
|
|
| |
Instead, use `FuzzyReleaseMatcher.match`, which has approximately the
same behavior.
|
|
|
|
|
|
|
|
| |
Goal of this refactoring was to make the matching process a bit more
configurable by using a class and a cascade of queries.
For a limited test set: `FuzzyReleaseMatcher.match` is works the same as
`match_release_fuzzy`.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
background: verifying hundreds of millions of documents turned out to be
a bit slow; anecdata: running clustering and verification over 1.8B
inputs tooks over 50h; cf. the Go port (skate) required about 2-4h for
those operations. Also: with Go we do not need the extra GNU parallel
wrapping.
In any case, we aim for fuzzycat refactoring to provide:
* better, more configurable verification and small scale matching
* removal of batch clustering code (and improve refcat docs)
* a place for a bit more generic, similarity based utils
The most important piece in fuzzycat is a CSV file containing hand
picked test examples for verification - and the code that is able to
fulfill that test suite. We want to make this part more robust.
|
|
|
|
|
| |
We want to first move to elasticsearch dsl and will reactivate and
extends after refactoring.
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| | |
* 'master' of git.archive.org:webgroup/fuzzycat:
simplify README for general audience; move some content to notes
sandcrawler slugify: lower-case greek ambiguity (OCR)
DOI clean/normalize helper; and use in verification etc
verify: page count parsing and comparison improvements
|
| | |
|
| | |
|
|/ |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| |
| |
| | |
refactoring/upstreaming fuzzycat "live" matching helpers
See merge request webgroup/fuzzycat!2
|
| |
| |
| |
| |
| |
| |
| | |
Some of these are a little redundant, in that calling code could
trivially re-implement. However, I think these are good starters for
stable external API interfaces, leaving us room to iterate and refactor
lower-level implementations behind the scenes.
|
| | |
|
| | |
|
|\|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* bnewbold-dev-setup:
dynaconf: switch to fuzzycat.config import across project
upgrade to python3.8
gitlab CI: try 'make deps' and 'make test'
makefile: run common commands inside pipenv
makefile: change 'deps' to be simple --dev --deploy
make fmt
|
| |
| |
| |
| | |
This is the recommended way to use dynaconf.
|
| | |
|
| | |
|
| | |
|
|/ |
|
|
|
|
| |
* https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html
|
|
|
|
| |
Will compress intermediate results with zstd (https://git.io/Jt00y9).
|
| |
|
|
|
|
| |
different, but related; verify says: "strong"
|
| |
|
| |
|
|
|
|
| |
same DOI, but repeated slash
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
| |
unfortunately, md is partial as for page count (e.g. "29" in md, but
"29-45" on publisher site:
https://academic.oup.com/restud/article-abstract/41/5/29/1522050)
|
|
|
|
|
| |
currently, status is STRONG; having article and comments attached to a
single work item might be useful
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|