Commit message (Collapse) | Author | Age | Files | Lines | ||
---|---|---|---|---|---|---|
... | ||||||
* | verify: allow a larger gap | Martin Czygan | 2020-11-19 | 1 | -1/+6 | |
| | ||||||
* | verify: account for article/article-journal | Martin Czygan | 2020-11-19 | 1 | -1/+4 | |
| | ||||||
* | update verification case list | Martin Czygan | 2020-11-19 | 2 | -9/+20 | |
| | ||||||
* | update notes | Martin Czygan | 2020-11-19 | 2 | -12/+29 | |
| | ||||||
* | update notes | Martin Czygan | 2020-11-19 | 1 | -34/+43 | |
| | ||||||
* | ignore sample files | Martin Czygan | 2020-11-19 | 1 | -0/+3 | |
| | ||||||
* | update README | Martin Czygan | 2020-11-18 | 1 | -0/+58 | |
| | ||||||
* | verify: fix a None | Martin Czygan | 2020-11-18 | 1 | -2/+2 | |
| | ||||||
* | cluster: log progress | Martin Czygan | 2020-11-17 | 1 | -1/+3 | |
| | ||||||
* | cleanup sql stuff for now | Martin Czygan | 2020-11-17 | 1 | -13/+0 | |
| | ||||||
* | move blacklist to the end | Martin Czygan | 2020-11-17 | 1 | -227/+666 | |
| | ||||||
* | cleanup blacklist | Martin Czygan | 2020-11-17 | 1 | -1524/+1531 | |
| | ||||||
* | update stats | Martin Czygan | 2020-11-17 | 1 | -245/+1561 | |
| | ||||||
* | fix subtitle check | Martin Czygan | 2020-11-17 | 1 | -2/+11 | |
| | ||||||
* | extend title blacklist | Martin Czygan | 2020-11-17 | 1 | -34/+1293 | |
| | ||||||
* | update stats | Martin Czygan | 2020-11-17 | 1 | -9/+9 | |
| | ||||||
* | update blacklist | Martin Czygan | 2020-11-17 | 1 | -8/+65 | |
| | ||||||
* | update blacklist | Martin Czygan | 2020-11-17 | 1 | -4/+16 | |
| | ||||||
* | update stats | Martin Czygan | 2020-11-17 | 1 | -5/+7 | |
| | ||||||
* | update blacklist | Martin Czygan | 2020-11-17 | 1 | -12/+15 | |
| | ||||||
* | update notes | Martin Czygan | 2020-11-17 | 1 | -14/+52 | |
| | ||||||
* | update docs and blacklist | Martin Czygan | 2020-11-17 | 1 | -0/+28 | |
| | ||||||
* | update blacklists | Martin Czygan | 2020-11-17 | 1 | -2/+22 | |
| | ||||||
* | be less fine grained with datasets | Martin Czygan | 2020-11-17 | 1 | -1/+11 | |
| | ||||||
* | handle newline in titles | Martin Czygan | 2020-11-17 | 1 | -14/+10 | |
| | ||||||
* | update blacklist | Martin Czygan | 2020-11-17 | 1 | -1/+1 | |
| | ||||||
* | update blacklist | Martin Czygan | 2020-11-16 | 1 | -8/+39 | |
| | ||||||
* | add more blacklists | Martin Czygan | 2020-11-16 | 1 | -15/+32 | |
| | ||||||
* | wip: author_slug | Martin Czygan | 2020-11-15 | 1 | -2/+26 | |
| | ||||||
* | update title blacklist | Martin Czygan | 2020-11-14 | 1 | -0/+1 | |
| | ||||||
* | wip: verification and tests | Martin Czygan | 2020-11-14 | 3 | -48/+236 | |
| | ||||||
* | update Pipfile | Martin Czygan | 2020-11-14 | 2 | -50/+69 | |
| | ||||||
* | fix tests | Martin Czygan | 2020-11-13 | 4 | -55/+4 | |
| | ||||||
* | wip: verification | Martin Czygan | 2020-11-13 | 3 | -17/+181 | |
| | | | | | | | | | | | | | Output currently (1m sample): { "unique": 916075, "too_large": 575, "dummy": 10307, "contrib_miss": 27215, "short_title": 1379, "arxiv_v": 8943 } | |||||
* | Merge branch 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat ↵ | Martin Czygan | 2020-11-12 | 6 | -54/+761 | |
|\ | | | | | | | | | | | | | | | | | | | | | | | | | into bnewbold-bnewbold-sandcrawler * 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat: sandcrawler slugify: yet more unicode corner-cases add sandcrawler-style title key method cluster: count empty keys (and don't return them) pipenv: explicit regex dependency gitignore: add .swp (vim) make: run pytest over fuzzycat/ to catch inline tests add support for key denylist | |||||
| * | sandcrawler slugify: yet more unicode corner-cases | Bryan Newbold | 2020-11-10 | 1 | -16/+47 | |
| | | ||||||
| * | add sandcrawler-style title key method | Bryan Newbold | 2020-11-10 | 2 | -3/+132 | |
| | | ||||||
| * | cluster: count empty keys (and don't return them) | Bryan Newbold | 2020-11-10 | 1 | -0/+3 | |
| | | ||||||
| * | pipenv: explicit regex dependency | Bryan Newbold | 2020-11-10 | 1 | -0/+1 | |
| | | | | | | | | | | | | | | | | regex, unlike stdlib 're' module, has unicode support. I couldn't get pipenv to lock after adding this dependency, even though Pipfile.lock already includes regex as a sub-dependency of something else. | |||||
| * | gitignore: add .swp (vim) | Bryan Newbold | 2020-11-10 | 1 | -0/+4 | |
| | | ||||||
| * | make: run pytest over fuzzycat/ to catch inline tests | Bryan Newbold | 2020-11-10 | 1 | -3/+3 | |
| | | ||||||
| * | add support for key denylist | Bryan Newbold | 2020-11-10 | 3 | -4/+574 | |
| | | | | | | | | | | | | | | | | | | | | | | This is to filter out cluster rows where the resulting key is in a given text file (one key per line). The intent is to filter out records with either poor metadata, or very generic metadata, for fuzzy matching. Eg, in many cases it is better to just not try matching "Letter to the Editor" to any record. This won't always be the case; we might have journal, volume, issue, and page, which would allow a match. So this can be specified on the command line. | |||||
* | | wip: note on 'serde' overhead | Martin Czygan | 2020-11-12 | 1 | -0/+1 | |
| | | ||||||
* | | reduce custom schema for now | Martin Czygan | 2020-11-12 | 1 | -13/+16 | |
| | | ||||||
* | | move fileinput.input out of the cluster | Martin Czygan | 2020-11-12 | 2 | -78/+77 | |
| | | | | | | | | The cluster class should work with iterable, so testing will be easier. | |||||
* | | move main.py to __main__.py | Martin Czygan | 2020-11-12 | 1 | -0/+0 | |
| | | ||||||
* | | update deps | Martin Czygan | 2020-11-12 | 5 | -149/+148 | |
| | | ||||||
* | | reduce dependencies | Martin Czygan | 2020-11-12 | 2 | -19/+14 | |
|/ | | | | * removed pydantic, orjson | |||||
* | cluster notes | Martin Czygan | 2020-11-11 | 2 | -0/+73 | |
| | ||||||
* | verify stub | Martin Czygan | 2020-11-11 | 2 | -21/+8 | |
| |