aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/normal.py
Commit message (Collapse)AuthorAgeFilesLines
* improve lookup_license_slug helper and lookup tableBryan Newbold2021-11-101-6/+9
|
* refactor importer metadata tables into separate file; move some helpers aroundBryan Newbold2021-11-101-81/+34
| | | | | | | - MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
* clean_doi: stop mutating double-slash DOIs, except for 10.1037 prefixBryan Newbold2021-11-091-1/+2
|
* typing: first batch of python bulk type annotationsBryan Newbold2021-11-031-17/+17
| | | | | | While these changes are more delicate than simple lint changes, this specific batch of edits and annotations was *relatively* simple, and resulted in few code changes other than function signature additions.
* lint: resolve existing mypy type errorsBryan Newbold2021-11-021-10/+12
| | | | | | | | | Adds annotations and re-workes dataflow to satisfy existing mypy issues, without adding any additional type annotations to, eg, function signatures. There will probably be many more type errors when annotations are all added.
* fmt (black): fatcat_tools/Bryan Newbold2021-11-021-122/+179
|
* python: isort everythingBryan Newbold2021-11-021-2/+2
|
* lint: simple, safe inline lint fixesBryan Newbold2021-11-021-52/+52
| | | | '==' vs 'is'; 'not a in b' vs 'a not in b'; etc
* ftfy 'fix_entities' argument has been renamedBryan Newbold2021-11-021-4/+4
|
* try some type annotationsBryan Newbold2021-11-021-9/+10
|
* python: normalization/validation support for handle identifiers (hdl)Bryan Newbold2021-10-131-0/+33
|
* clean_doi() should lower-case returned DOIBryan Newbold2021-06-071-1/+4
| | | | | | | | | | Code in a number of places (including Pubmed importer) assumed that this was already lower-casing DOIs, resulting in some broken metadata getting created. See also: https://github.com/internetarchive/fatcat/issues/83 This is just the first step of mitigation.
* normalizer: test for un-versioned arxiv_idBryan Newbold2020-12-241-0/+4
|
* wikidata QID normalize helperBryan Newbold2020-12-171-2/+24
|
* HACK: squash intermitent failure of detect_text_lang() testBryan Newbold2020-12-111-1/+2
| | | | | This is an open bug; it is important that tests pass on master branch however.
* langdetect: more text for 'zh' test caseBryan Newbold2020-11-201-1/+1
| | | | | | This is an attempt to fix spurious test failures, in which this text block was getting detected as 'kr' on occasion. Apparently there is non-determinism in the langdetect package.
* clean DOI: ban all non-ASCII charactersBryan Newbold2020-11-191-1/+4
| | | | | | | I believe this is safe and matches the regex filter in rust (fatcatd). Keep hitting one-off DOIs that were passing through python check, so being more strict from here forward.
* normal: handle langdetect of 'zh-cn' (not len=2)Bryan Newbold2020-11-191-0/+3
|
* handle more non-ASCII DOI casesBryan Newbold2020-11-191-1/+3
|
* more python normalizers, and move from importer commonBryan Newbold2020-11-191-0/+322
| | | | | | | | | | | | Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
* normalizer: filter out a specific non-ASCII character in DOIBryan Newbold2020-11-041-1/+3
|
* lint (flake8) tool python filesBryan Newbold2020-07-011-1/+0
|
* disallow a specific unicode character from DOIsBryan Newbold2020-06-261-0/+6
|
* consistently use raw string prefix for regexBryan Newbold2020-04-171-5/+5
|
* normal: DOI corner-case from pubmed importBryan Newbold2020-01-191-0/+9
|
* do not normalize "en dash" in DOIMartin Czygan2020-01-171-2/+5
| | | | | | | | | Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
* doi parsing fixesBryan Newbold2019-12-231-0/+7
| | | | | | | | | | Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
* normalizers: clean_pmid(), and handle nulls in all other cleanersBryan Newbold2019-12-231-0/+31
|
* handle more external identifiers in pythonBryan Newbold2019-09-181-14/+97
| | | | | This makes it possible to, eg, past an arxiv identifier or SHA-1 hash in the general search box and do a quick lookup.
* start work on 'generic' search boxBryan Newbold2019-06-131-0/+95