fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge branch 'bnewbold-import-refactors' into 'master'	bnewbold	2021-11-11	1	-313/+11
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
\| *	refactor importer metadata tables into separate file; move some helpers around	Bryan Newbold	2021-11-10	1	-314/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	- MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
\| *	importers: refactor imports of clean() and other normalization helpers	Bryan Newbold	2021-11-10	1	-4/+11
\| \|
* \|	pubmed: allow updates if PMCID does not exist yet	Bryan Newbold	2021-11-10	1	-1/+6
\|/ \| \| \| \| \| \| \| \| \| \|	The intent of this change is to start updating Pubmed metadata records when a PMCID has been assigned, but that ext_id hasn't been recorded in fatcat yet. It is likely that this change will result in some additional duplicate PMCIDs in the catalog. But the principle is that the PMID is the primary pubmed identifier, and all records with a PMID should have the PMCID that pubmed indicates, even if there exists another incorrect record.
*	typing: relatively simple type check fixes	Bryan Newbold	2021-11-03	1	-17/+7
\| \| \| \| \| \| \|	These mostly add new variable names so that existing variables aren't overwritten with a new type; delay coercing '{}' or '[]' to 'None' until the last minute; adding is-not-None checks to conditional clauses; and similar small changes.
*	typing: initial annotations on importers	Bryan Newbold	2021-11-03	1	-10/+17
\| \| \| \| \|	This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
*	importers: remove unused __main__ routine	Bryan Newbold	2021-11-03	1	-5/+0
\| \| \| \| \| \|	These perhaps were used in initial develoment or testing? fatcat_import.py is the correct way to do these imports, even for testing/development.
*	fmt (black): fatcat_tools/	Bryan Newbold	2021-11-02	1	-158/+197
\|
*	lint: simple, safe inline lint fixes	Bryan Newbold	2021-11-02	1	-1/+1
\| \| \| \|	'==' vs 'is'; 'not a in b' vs 'a not in b'; etc
*	lint/fmt: remove all 'import *'	Bryan Newbold	2021-11-02	1	-5/+7
\|
*	python: partial importer utilization of new schema changes	Bryan Newbold	2021-10-13	1	-3/+9
\|
*	fix issnl typo in pubmed	Bryan Newbold	2020-07-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Oh no! This bug may actually have had significant negative impact on metadata in fatcat, in terms of missing container_id associations with pubmed entities. There are about 500k release entities with a PMID but no container_id. Of those, 89k have at least a container_name. Unclear how many would have matched to ISSN-L and thus to a container.
*	lint (flake8) tool python files	Bryan Newbold	2020-07-01	1	-4/+2
\|
*	importers: clarify handling of ApiException	Bryan Newbold	2020-05-22	1	-0/+1
\| \| \| \| \| \| \| \|	One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown.
*	pubmed: use untranslated title if translated not available	Bryan Newbold	2020-04-01	1	-0/+6
\| \| \| \| \| \| \|	The primary motivation for this change is that fatcat requires a non-empty title for each release entity. Pubmed/Medline occasionally indexes just a VenacularTitle with no ArticleTitle for foreign publications, and currently those records don't end up in fatcat at all.
*	importers: replace newlines in get_text() strings	Bryan Newbold	2020-04-01	1	-5/+7
\|
*	pubmed: bunch of .get_text() instead of .string	Bryan Newbold	2020-03-28	1	-12/+12
\| \| \| \| \| \| \| \| \| \| \|	Yikes! Apparently when a tag has child tags, .string will return None instead of all the strings. .get_text() returns all of it: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string I've things like identifiers as .string, when we expect only a single string inside.
*	pubmed: handle multiple ReferenceList	Bryan Newbold	2020-03-20	1	-1/+4
\| \| \| \| \| \| \|	This resolves a situation noticed in prod where we were only importing/updating a single reference per article. Includes a regression test.
*	pubmed: update many more metadata fields	Bryan Newbold	2020-03-19	1	-0/+22
\| \| \| \| \| \| \|	In particular, with daily updates in most cases the DOI will be registered first, then the entity updated with PMID when that is available. Often the pubmed metadata will be more complete, with abstracts etc, and we'll want those improvements.
*	importers: control update behavior with more-standard flag	Bryan Newbold	2020-01-06	1	-0/+4
\|
*	pubmed: if doing update, also do subtitle schema update	Bryan Newbold	2019-12-23	1	-1/+9
\|
*	pubmed: improve warning and stderr formatting	Bryan Newbold	2019-12-23	1	-5/+6
\|
*	pubmed: use standard identifier cleaners	Bryan Newbold	2019-12-23	1	-17/+14
\|
*	pubmed: remove unused extid mapping code	Bryan Newbold	2019-12-23	1	-29/+0
\|
*	pubmed: do reference lookups by default	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	pubmed: null doi parsing check	Bryan Newbold	2019-12-23	1	-1/+1
\|
*	add basic MedlineDate year parsing	Bryan Newbold	2019-12-23	1	-0/+11
\|
*	refactor all python source for client lib name	Bryan Newbold	2019-09-05	1	-16/+16
\|
*	more pubmed importer fixes	Bryan Newbold	2019-06-03	1	-6/+13
\|
*	yet another pubmed weird DOI corner case	Bryan Newbold	2019-05-29	1	-1/+1
\|
*	handle pubmed CollectiveName null-ness	Bryan Newbold	2019-05-29	1	-1/+1
\|
*	handle empty retraction_of.PMID in pubmed importer	Bryan Newbold	2019-05-29	1	-2/+4
\|
*	more MARC languages, and less verbose reporting	Bryan Newbold	2019-05-24	1	-1/+1
\|
*	pubmed DOIs need strip()	Bryan Newbold	2019-05-22	1	-1/+1
\|
*	pubmed: try to work around multi-edits	Bryan Newbold	2019-05-22	1	-3/+13
\|
*	more strict pubmed DOI handling	Bryan Newbold	2019-05-22	1	-1/+3
\|
*	more pubmed checks; handle PMID/DOI mismatch differently	Bryan Newbold	2019-05-22	1	-2/+7
\|
*	all new importers need to set contrib index (order)	Bryan Newbold	2019-05-22	1	-0/+4
\|
*	pubmed importer command and tweaks	Bryan Newbold	2019-05-22	1	-9/+227
\|
*	importers: create containers by default	Bryan Newbold	2019-05-21	1	-1/+2
\|
*	updates to pubmed importer	Bryan Newbold	2019-05-21	1	-32/+60
\|
*	fix lint issue in pubmed importer	Bryan Newbold	2019-05-21	1	-1/+1
\|
*	tweaks to new imports/tests	Bryan Newbold	2019-05-21	1	-6/+4
\|
*	initial pubmed importer	Bryan Newbold	2019-05-21	1	-0/+512