summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/normal.py
Commit message (Collapse)AuthorAgeFilesLines
* wikidata QID normalize helperBryan Newbold2020-12-171-2/+24
|
* HACK: squash intermitent failure of detect_text_lang() testBryan Newbold2020-12-111-1/+2
| | | | | This is an open bug; it is important that tests pass on master branch however.
* langdetect: more text for 'zh' test caseBryan Newbold2020-11-201-1/+1
| | | | | | This is an attempt to fix spurious test failures, in which this text block was getting detected as 'kr' on occasion. Apparently there is non-determinism in the langdetect package.
* clean DOI: ban all non-ASCII charactersBryan Newbold2020-11-191-1/+4
| | | | | | | I believe this is safe and matches the regex filter in rust (fatcatd). Keep hitting one-off DOIs that were passing through python check, so being more strict from here forward.
* normal: handle langdetect of 'zh-cn' (not len=2)Bryan Newbold2020-11-191-0/+3
|
* handle more non-ASCII DOI casesBryan Newbold2020-11-191-1/+3
|
* more python normalizers, and move from importer commonBryan Newbold2020-11-191-0/+322
| | | | | | | | | | | | Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
* normalizer: filter out a specific non-ASCII character in DOIBryan Newbold2020-11-041-1/+3
|
* lint (flake8) tool python filesBryan Newbold2020-07-011-1/+0
|
* disallow a specific unicode character from DOIsBryan Newbold2020-06-261-0/+6
|
* consistently use raw string prefix for regexBryan Newbold2020-04-171-5/+5
|
* normal: DOI corner-case from pubmed importBryan Newbold2020-01-191-0/+9
|
* do not normalize "en dash" in DOIMartin Czygan2020-01-171-2/+5
| | | | | | | | | Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
* doi parsing fixesBryan Newbold2019-12-231-0/+7
| | | | | | | | | | Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
* normalizers: clean_pmid(), and handle nulls in all other cleanersBryan Newbold2019-12-231-0/+31
|
* handle more external identifiers in pythonBryan Newbold2019-09-181-14/+97
| | | | | This makes it possible to, eg, past an arxiv identifier or SHA-1 hash in the general search box and do a quick lookup.
* start work on 'generic' search boxBryan Newbold2019-06-131-0/+95