aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/common.py
Commit message (Collapse)AuthorAgeFilesLines
* fix typo in fileset comparison helperBryan Newbold2022-03-231-1/+1
|
* ingest fileset fixes, and some test coverageBryan Newbold2022-03-231-0/+11
|
* codespell fixes in python code (comments)Bryan Newbold2021-11-241-2/+2
|
* Merge branch 'bnewbold-import-refactors' into 'master'bnewbold2021-11-111-65/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | import refactors and deprecations Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes. The Datacite-specific stuff could use review here. Remove unused/deprecated/dead code: - cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers - "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used) Refactors: - moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code) - shuffled around relative imports and some function names ("clean_str" vs. "clean") Some actual behavioral changes: - remove some Datacite-specific license slugs - stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!) - remove some excess metadata from datacite 'extra' fields
| * refactor importer metadata tables into separate file; move some helpers aroundBryan Newbold2021-11-101-59/+2
| | | | | | | | | | | | | | - MAX_ABSTRACT_LENGTH set in a single place (importer common) - merge datacite license slug table in to common table, removing some TDM-specific licenses (which do not apply in the context of preserving the full work)
| * importers: refactor imports of clean() and other normalization helpersBryan Newbold2021-11-101-4/+1
| |
| * importers: use clean_doi() in many more (all?) importersBryan Newbold2021-11-091-3/+2
| |
* | imports: generic file cleanup removes exact duplicate URLsBryan Newbold2021-11-091-0/+9
|/
* typing: initial annotations on importersBryan Newbold2021-11-031-47/+99
| | | | | This commit just adds the type annotations, doesn't do fixes to code to make type checking pass.
* re-fix some lint issues after big 'fmt'Bryan Newbold2021-11-021-2/+2
|
* fmt (black): fatcat_tools/Bryan Newbold2021-11-021-92/+106
|
* python: isort everythingBryan Newbold2021-11-021-12/+12
|
* small python tweaks for annotations, importsBryan Newbold2021-11-021-1/+1
|
* try some type annotationsBryan Newbold2021-11-021-33/+34
|
* generic fileset importer class, with test coverageBryan Newbold2021-10-141-0/+4
|
* kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| | | | Behavior and motivation described in the kafka json import comment.
* importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
|
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-2/+0
|
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
|
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-171-15/+28
|
* update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-171-3/+3
| | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases.
* add fuzzy matching helper to importer base classBryan Newbold2020-12-161-2/+62
| | | | Using fuzzycat. Add basic test coverage.
* more python normalizers, and move from importer commonBryan Newbold2020-11-191-154/+4
| | | | | | | | | | | | Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
* remove spurious print statementBryan Newbold2020-09-031-1/+0
|
* generic file entity clean-ups as part of file_meta importerBryan Newbold2020-09-021-0/+47
|
* simple lint (flake8) fixes over python codebaseBryan Newbold2020-07-231-1/+0
| | | | | | These should not have any behavior changes, though a number of exception catches are now more general, and there may be long-tail exceptions getting thrown in these statements.
* lint (flake8) tool python filesBryan Newbold2020-07-011-13/+13
|
* importers: clarify handling of ApiExceptionBryan Newbold2020-05-221-4/+8
| | | | | | | | One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown.
* consistently use raw string prefix for regexBryan Newbold2020-04-171-1/+1
|
* Merge pull request #53 from EdwardBetts/spellingbnewbold2020-03-271-1/+1
|\ | | | | Correct spelling mistakes
| * Correct spelling mistakesEdward Betts2020-03-271-1/+1
| |
* | Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-101-0/+65
|\ \ | |/ |/| | | | | pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
| * common: use smaller batch size since XML parsing may be slowMartin Czygan2020-03-101-1/+1
| | | | | | | | | | | | | | | | Address kafka tradeoff between long and short time-outs. Shorter time-outs would facilitate > consumer group re-balances and other consumer group state changes [...] in a reasonable human time-frame.
| * pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-191-0/+65
| | | | | | | | | | | | | | * add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher
* | add some more domain/rel URL mappingsBryan Newbold2020-02-221-0/+9
|/
* fix KafkaError worker reporting for partition errorsBryan Newbold2020-01-291-1/+1
|
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-0/+1
|
* write diagnostic messages to stderrMartin Czygan2019-12-161-2/+2
| | | | | During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
* Merge branch 'martin-importers-common-doc-fix' into 'master'Martin Czygan2019-12-141-13/+10
|\ | | | | | | | | Update EntityImporter docstring. See merge request webgroup/fatcat!9
| * complete parse_record docstringMartin Czygan2019-12-141-0/+6
| |
| * Update EntityImporter docstring.Martin Czygan2019-12-131-13/+4
| | | | | | | | I believe the required method is `parse_record`, not `parse`.
* | revert accidentally commited test timingBryan Newbold2019-12-131-2/+2
| | | | | | | | Also fix a spurious typo.
* | ensure importer description arg isn't clobberedBryan Newbold2019-12-121-1/+3
| |
* | flush importer editgroups every few minutesBryan Newbold2019-12-121-5/+20
| |
* | EntityImporter: submit (not accept) modeBryan Newbold2019-12-121-2/+14
|/ | | | | For use with bots that don't have admin privileges, or where human follow-up review is desired.
* crude support for 'sandcrawler' kafka topicsBryan Newbold2019-11-151-2/+3
|
* refactor duplicated b32_hex function in importersBryan Newbold2019-10-081-0/+9
|
* review/fix all confluent-kafka produce codeBryan Newbold2019-09-201-1/+0
|
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-201-10/+24
| | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
* small kafka tweaks for robustnessBryan Newbold2019-09-201-0/+3
|