fatcat - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	fix typo in fileset comparison helper	Bryan Newbold	2022-03-23	1	-1/+1
\|
*	ingest fileset fixes, and some test coverage	Bryan Newbold	2022-03-23	1	-0/+11
\|
*	codespell fixes in python code (comments)	Bryan Newbold	2021-11-24	1	-2/+2
\|
*	Merge branch 'bnewbold-import-refactors' into 'master'	bnewbold	2021-11-11	1	-65/+4
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
\| *	refactor importer metadata tables into separate file; move some helpers around	Bryan Newbold	2021-11-10	1	-59/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	- MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
\| *	importers: refactor imports of clean() and other normalization helpers	Bryan Newbold	2021-11-10	1	-4/+1
\| \|
\| *	importers: use clean_doi() in many more (all?) importers	Bryan Newbold	2021-11-09	1	-3/+2
\| \|
* \|	imports: generic file cleanup removes exact duplicate URLs	Bryan Newbold	2021-11-09	1	-0/+9
\|/
*	typing: initial annotations on importers	Bryan Newbold	2021-11-03	1	-47/+99
\| \| \| \| \|	This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
*	re-fix some lint issues after big 'fmt'	Bryan Newbold	2021-11-02	1	-2/+2
\|
*	fmt (black): fatcat_tools/	Bryan Newbold	2021-11-02	1	-92/+106
\|
*	python: isort everything	Bryan Newbold	2021-11-02	1	-12/+12
\|
*	small python tweaks for annotations, imports	Bryan Newbold	2021-11-02	1	-1/+1
\|
*	try some type annotations	Bryan Newbold	2021-11-02	1	-33/+34
\|
*	generic fileset importer class, with test coverage	Bryan Newbold	2021-10-14	1	-0/+4
\|
*	kafka import: optional 'force-flush' mode for some importers	Bryan Newbold	2021-10-01	1	-0/+13
\| \| \| \|	Behavior and motivation described in the kafka json import comment.
*	importer common: more verbose logging (with counts)	Bryan Newbold	2021-10-01	1	-4/+4
\|
*	small python lint fixes (no behavior change)	Bryan Newbold	2021-05-25	1	-2/+0
\|
*	fuzzy: set 120 second timeout on ES lookups	Bryan Newbold	2020-12-23	1	-1/+1
\|
*	add 'lxml' mode for large XML file import, and multi-tags	Bryan Newbold	2020-12-17	1	-15/+28
\|
*	update fuzzy helper to pass 'reason' through to import code	Bryan Newbold	2020-12-17	1	-3/+3
\| \| \| \| \|	The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases.
*	add fuzzy matching helper to importer base class	Bryan Newbold	2020-12-16	1	-2/+62
\| \| \| \|	Using fuzzycat. Add basic test coverage.
*	more python normalizers, and move from importer common	Bryan Newbold	2020-11-19	1	-154/+4
\| \| \| \| \| \| \| \| \| \| \| \|	Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
*	remove spurious print statement	Bryan Newbold	2020-09-03	1	-1/+0
\|
*	generic file entity clean-ups as part of file_meta importer	Bryan Newbold	2020-09-02	1	-0/+47
\|
*	simple lint (flake8) fixes over python codebase	Bryan Newbold	2020-07-23	1	-1/+0
\| \| \| \| \| \|	These should not have any behavior changes, though a number of exception catches are now more general, and there may be long-tail exceptions getting thrown in these statements.
*	lint (flake8) tool python files	Bryan Newbold	2020-07-01	1	-13/+13
\|
*	importers: clarify handling of ApiException	Bryan Newbold	2020-05-22	1	-4/+8
\| \| \| \| \| \| \| \|	One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown.
*	consistently use raw string prefix for regex	Bryan Newbold	2020-04-17	1	-1/+1
\|
*	Merge pull request #53 from EdwardBetts/spelling	bnewbold	2020-03-27	1	-1/+1
\|\ \| \| \| \|	Correct spelling mistakes
\| *	Correct spelling mistakes	Edward Betts	2020-03-27	1	-1/+1
\| \|
* \|	Merge branch 'martin-kafka-bs4-import' into 'master'	Martin Czygan	2020-03-10	1	-0/+65
\|\ \ \| \|/ \|/\| \| \| \| \|	pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
\| *	common: use smaller batch size since XML parsing may be slow	Martin Czygan	2020-03-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Address kafka tradeoff between long and short time-outs. Shorter time-outs would facilitate > consumer group re-balances and other consumer group state changes [...] in a reasonable human time-frame.
\| *	pubmed ftp harvest and KafkaBs4XmlPusher	Martin Czygan	2020-02-19	1	-0/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	* add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher
* \|	add some more domain/rel URL mappings	Bryan Newbold	2020-02-22	1	-0/+9
\|/
*	fix KafkaError worker reporting for partition errors	Bryan Newbold	2020-01-29	1	-1/+1
\|
*	importers: control update behavior with more-standard flag	Bryan Newbold	2020-01-06	1	-0/+1
\|
*	write diagnostic messages to stderr	Martin Czygan	2019-12-16	1	-2/+2
\| \| \| \| \|	During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
*	Merge branch 'martin-importers-common-doc-fix' into 'master'	Martin Czygan	2019-12-14	1	-13/+10
\|\ \| \| \| \| \| \| \| \|	Update EntityImporter docstring. See merge request webgroup/fatcat!9
\| *	complete parse_record docstring	Martin Czygan	2019-12-14	1	-0/+6
\| \|
\| *	Update EntityImporter docstring.	Martin Czygan	2019-12-13	1	-13/+4
\| \| \| \| \| \| \| \|	I believe the required method is `parse_record`, not `parse`.
* \|	revert accidentally commited test timing	Bryan Newbold	2019-12-13	1	-2/+2
\| \| \| \| \| \| \| \|	Also fix a spurious typo.
* \|	ensure importer description arg isn't clobbered	Bryan Newbold	2019-12-12	1	-1/+3
\| \|
* \|	flush importer editgroups every few minutes	Bryan Newbold	2019-12-12	1	-5/+20
\| \|
* \|	EntityImporter: submit (not accept) mode	Bryan Newbold	2019-12-12	1	-2/+14
\|/ \| \| \| \|	For use with bots that don't have admin privileges, or where human follow-up review is desired.
*	crude support for 'sandcrawler' kafka topics	Bryan Newbold	2019-11-15	1	-2/+3
\|
*	refactor duplicated b32_hex function in importers	Bryan Newbold	2019-10-08	1	-0/+9
\|
*	review/fix all confluent-kafka produce code	Bryan Newbold	2019-09-20	1	-1/+0
\|
*	small fixes to confluent-kafka importers/workers	Bryan Newbold	2019-09-20	1	-10/+24
\| \| \| \| \| \| \| \|	- decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
*	small kafka tweaks for robustness	Bryan Newbold	2019-09-20	1	-0/+3
\|