fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	wikidata QID normalize helper	Bryan Newbold	2020-12-17	1	-2/+24
\|
*	HACK: squash intermitent failure of detect_text_lang() test	Bryan Newbold	2020-12-11	1	-1/+2
\| \| \| \| \|	This is an open bug; it is important that tests pass on master branch however.
*	langdetect: more text for 'zh' test case	Bryan Newbold	2020-11-20	1	-1/+1
\| \| \| \| \| \|	This is an attempt to fix spurious test failures, in which this text block was getting detected as 'kr' on occasion. Apparently there is non-determinism in the langdetect package.
*	clean DOI: ban all non-ASCII characters	Bryan Newbold	2020-11-19	1	-1/+4
\| \| \| \| \| \| \|	I believe this is safe and matches the regex filter in rust (fatcatd). Keep hitting one-off DOIs that were passing through python check, so being more strict from here forward.
*	normal: handle langdetect of 'zh-cn' (not len=2)	Bryan Newbold	2020-11-19	1	-0/+3
\|
*	handle more non-ASCII DOI cases	Bryan Newbold	2020-11-19	1	-1/+3
\|
*	more python normalizers, and move from importer common	Bryan Newbold	2020-11-19	1	-0/+322
\| \| \| \| \| \| \| \| \| \| \| \|	Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
*	normalizer: filter out a specific non-ASCII character in DOI	Bryan Newbold	2020-11-04	1	-1/+3
\|
*	lint (flake8) tool python files	Bryan Newbold	2020-07-01	1	-1/+0
\|
*	disallow a specific unicode character from DOIs	Bryan Newbold	2020-06-26	1	-0/+6
\|
*	consistently use raw string prefix for regex	Bryan Newbold	2020-04-17	1	-5/+5
\|
*	normal: DOI corner-case from pubmed import	Bryan Newbold	2020-01-19	1	-0/+9
\|
*	do not normalize "en dash" in DOI	Martin Czygan	2020-01-17	1	-2/+5
\| \| \| \| \| \| \| \| \|	Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
*	doi parsing fixes	Bryan Newbold	2019-12-23	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
*	normalizers: clean_pmid(), and handle nulls in all other cleaners	Bryan Newbold	2019-12-23	1	-0/+31
\|
*	handle more external identifiers in python	Bryan Newbold	2019-09-18	1	-14/+97
\| \| \| \| \|	This makes it possible to, eg, past an arxiv identifier or SHA-1 hash in the general search box and do a quick lookup.
*	start work on 'generic' search box	Bryan Newbold	2019-06-13	1	-0/+95