aboutsummaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* update notesMartin Czygan2020-11-201-0/+111
|
* verify: ignore certain types of release types for nowMartin Czygan2020-11-191-2/+4
|
* update notesMartin Czygan2020-11-191-1/+5
|
* update statsMartin Czygan2020-11-192-25/+30
|
* verify: ignore ids like solv-int/9606010v1 for nowMartin Czygan2020-11-191-4/+8
|
* verify: allow a larger gapMartin Czygan2020-11-191-1/+6
|
* verify: account for article/article-journalMartin Czygan2020-11-191-1/+4
|
* update verification case listMartin Czygan2020-11-192-9/+20
|
* update notesMartin Czygan2020-11-192-12/+29
|
* update notesMartin Czygan2020-11-191-34/+43
|
* ignore sample filesMartin Czygan2020-11-191-0/+3
|
* update READMEMartin Czygan2020-11-181-0/+58
|
* verify: fix a NoneMartin Czygan2020-11-181-2/+2
|
* cluster: log progressMartin Czygan2020-11-171-1/+3
|
* cleanup sql stuff for nowMartin Czygan2020-11-171-13/+0
|
* move blacklist to the endMartin Czygan2020-11-171-227/+666
|
* cleanup blacklistMartin Czygan2020-11-171-1524/+1531
|
* update statsMartin Czygan2020-11-171-245/+1561
|
* fix subtitle checkMartin Czygan2020-11-171-2/+11
|
* extend title blacklistMartin Czygan2020-11-171-34/+1293
|
* update statsMartin Czygan2020-11-171-9/+9
|
* update blacklistMartin Czygan2020-11-171-8/+65
|
* update blacklistMartin Czygan2020-11-171-4/+16
|
* update statsMartin Czygan2020-11-171-5/+7
|
* update blacklistMartin Czygan2020-11-171-12/+15
|
* update notesMartin Czygan2020-11-171-14/+52
|
* update docs and blacklistMartin Czygan2020-11-171-0/+28
|
* update blacklistsMartin Czygan2020-11-171-2/+22
|
* be less fine grained with datasetsMartin Czygan2020-11-171-1/+11
|
* handle newline in titlesMartin Czygan2020-11-171-14/+10
|
* update blacklistMartin Czygan2020-11-171-1/+1
|
* update blacklistMartin Czygan2020-11-161-8/+39
|
* add more blacklistsMartin Czygan2020-11-161-15/+32
|
* wip: author_slugMartin Czygan2020-11-151-2/+26
|
* update title blacklistMartin Czygan2020-11-141-0/+1
|
* wip: verification and testsMartin Czygan2020-11-143-48/+236
|
* update PipfileMartin Czygan2020-11-142-50/+69
|
* fix testsMartin Czygan2020-11-134-55/+4
|
* wip: verificationMartin Czygan2020-11-133-17/+181
| | | | | | | | | | | | | Output currently (1m sample): { "unique": 916075, "too_large": 575, "dummy": 10307, "contrib_miss": 27215, "short_title": 1379, "arxiv_v": 8943 }
* Merge branch 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat ↵Martin Czygan2020-11-126-54/+761
|\ | | | | | | | | | | | | | | | | | | | | | | | | into bnewbold-bnewbold-sandcrawler * 'bnewbold-sandcrawler' of https://github.com/bnewbold/fuzzycat: sandcrawler slugify: yet more unicode corner-cases add sandcrawler-style title key method cluster: count empty keys (and don't return them) pipenv: explicit regex dependency gitignore: add .swp (vim) make: run pytest over fuzzycat/ to catch inline tests add support for key denylist
| * sandcrawler slugify: yet more unicode corner-casesBryan Newbold2020-11-101-16/+47
| |
| * add sandcrawler-style title key methodBryan Newbold2020-11-102-3/+132
| |
| * cluster: count empty keys (and don't return them)Bryan Newbold2020-11-101-0/+3
| |
| * pipenv: explicit regex dependencyBryan Newbold2020-11-101-0/+1
| | | | | | | | | | | | | | | | regex, unlike stdlib 're' module, has unicode support. I couldn't get pipenv to lock after adding this dependency, even though Pipfile.lock already includes regex as a sub-dependency of something else.
| * gitignore: add .swp (vim)Bryan Newbold2020-11-101-0/+4
| |
| * make: run pytest over fuzzycat/ to catch inline testsBryan Newbold2020-11-101-3/+3
| |
| * add support for key denylistBryan Newbold2020-11-103-4/+574
| | | | | | | | | | | | | | | | | | | | | | This is to filter out cluster rows where the resulting key is in a given text file (one key per line). The intent is to filter out records with either poor metadata, or very generic metadata, for fuzzy matching. Eg, in many cases it is better to just not try matching "Letter to the Editor" to any record. This won't always be the case; we might have journal, volume, issue, and page, which would allow a match. So this can be specified on the command line.
* | wip: note on 'serde' overheadMartin Czygan2020-11-121-0/+1
| |
* | reduce custom schema for nowMartin Czygan2020-11-121-13/+16
| |
* | move fileinput.input out of the clusterMartin Czygan2020-11-122-78/+77
| | | | | | | | The cluster class should work with iterable, so testing will be easier.