fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	improve lookup_license_slug helper and lookup table	Bryan Newbold	2021-11-10	1	-6/+9
\|
*	refactor importer metadata tables into separate file; move some helpers around	Bryan Newbold	2021-11-10	1	-81/+34
\| \| \| \| \| \| \|	- MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
*	clean_doi: stop mutating double-slash DOIs, except for 10.1037 prefix	Bryan Newbold	2021-11-09	1	-1/+2
\|
*	typing: first batch of python bulk type annotations	Bryan Newbold	2021-11-03	1	-17/+17
\| \| \| \| \| \|	While these changes are more delicate than simple lint changes, this specific batch of edits and annotations was relatively simple, and resulted in few code changes other than function signature additions.
*	lint: resolve existing mypy type errors	Bryan Newbold	2021-11-02	1	-10/+12
\| \| \| \| \| \| \| \| \|	Adds annotations and re-workes dataflow to satisfy existing mypy issues, without adding any additional type annotations to, eg, function signatures. There will probably be many more type errors when annotations are all added.
*	fmt (black): fatcat_tools/	Bryan Newbold	2021-11-02	1	-122/+179
\|
*	python: isort everything	Bryan Newbold	2021-11-02	1	-2/+2
\|
*	lint: simple, safe inline lint fixes	Bryan Newbold	2021-11-02	1	-52/+52
\| \| \| \|	'==' vs 'is'; 'not a in b' vs 'a not in b'; etc
*	ftfy 'fix_entities' argument has been renamed	Bryan Newbold	2021-11-02	1	-4/+4
\|
*	try some type annotations	Bryan Newbold	2021-11-02	1	-9/+10
\|
*	python: normalization/validation support for handle identifiers (hdl)	Bryan Newbold	2021-10-13	1	-0/+33
\|
*	clean_doi() should lower-case returned DOI	Bryan Newbold	2021-06-07	1	-1/+4
\| \| \| \| \| \| \| \| \| \|	Code in a number of places (including Pubmed importer) assumed that this was already lower-casing DOIs, resulting in some broken metadata getting created. See also: https://github.com/internetarchive/fatcat/issues/83 This is just the first step of mitigation.
*	normalizer: test for un-versioned arxiv_id	Bryan Newbold	2020-12-24	1	-0/+4
\|
*	wikidata QID normalize helper	Bryan Newbold	2020-12-17	1	-2/+24
\|
*	HACK: squash intermitent failure of detect_text_lang() test	Bryan Newbold	2020-12-11	1	-1/+2
\| \| \| \| \|	This is an open bug; it is important that tests pass on master branch however.
*	langdetect: more text for 'zh' test case	Bryan Newbold	2020-11-20	1	-1/+1
\| \| \| \| \| \|	This is an attempt to fix spurious test failures, in which this text block was getting detected as 'kr' on occasion. Apparently there is non-determinism in the langdetect package.
*	clean DOI: ban all non-ASCII characters	Bryan Newbold	2020-11-19	1	-1/+4
\| \| \| \| \| \| \|	I believe this is safe and matches the regex filter in rust (fatcatd). Keep hitting one-off DOIs that were passing through python check, so being more strict from here forward.
*	normal: handle langdetect of 'zh-cn' (not len=2)	Bryan Newbold	2020-11-19	1	-0/+3
\|
*	handle more non-ASCII DOI cases	Bryan Newbold	2020-11-19	1	-1/+3
\|
*	more python normalizers, and move from importer common	Bryan Newbold	2020-11-19	1	-0/+322
\| \| \| \| \| \| \| \| \| \| \| \|	Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
*	normalizer: filter out a specific non-ASCII character in DOI	Bryan Newbold	2020-11-04	1	-1/+3
\|
*	lint (flake8) tool python files	Bryan Newbold	2020-07-01	1	-1/+0
\|
*	disallow a specific unicode character from DOIs	Bryan Newbold	2020-06-26	1	-0/+6
\|
*	consistently use raw string prefix for regex	Bryan Newbold	2020-04-17	1	-5/+5
\|
*	normal: DOI corner-case from pubmed import	Bryan Newbold	2020-01-19	1	-0/+9
\|
*	do not normalize "en dash" in DOI	Martin Czygan	2020-01-17	1	-2/+5
\| \| \| \| \| \| \| \| \|	Technically, [...] DOI names may incorporate any printable characters from the Universal Character Set (UCS-2), of ISO/IEC 10646, which is the character set defined by Unicode (https://www.doi.org/doi_handbook/2_Numbering.html#2.5.1). For mostly QA reasons, we currently treat a DOI with an "en dash" as invalid.
*	doi parsing fixes	Bryan Newbold	2019-12-23	1	-0/+7
\| \| \| \| \| \| \| \| \| \|	Replace emdash with regular dash. Replace double slash after partner ID with single slash. This conversion seems to be done by crossref automatically on lookup. I tried several examples, using doi.org resolver and Crossref API lookup. Note that there are a number of fatcat entities with '//' in the DOI.
*	normalizers: clean_pmid(), and handle nulls in all other cleaners	Bryan Newbold	2019-12-23	1	-0/+31
\|
*	handle more external identifiers in python	Bryan Newbold	2019-09-18	1	-14/+97
\| \| \| \| \|	This makes it possible to, eg, past an arxiv identifier or SHA-1 hash in the general search box and do a quick lookup.
*	start work on 'generic' search box	Bryan Newbold	2019-06-13	1	-0/+95