| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
import refactors and deprecations
Some of these are from old stale branches (the datacite subject metadata patch), but most are from yesterday and today. Sort of a hodge-podge, but the general theme is getting around to deferred cleanups and refactors specific to importer code before making some behavioral changes.
The Datacite-specific stuff could use review here.
Remove unused/deprecated/dead code:
- cdl_dash_dat and wayback_static importers, which were for specific early example entities and have been superseded by other importers
- "extid map" sqlite3 feature from several importers, was only used for initial bulk imports (and maybe should not have been used)
Refactors:
- moved a number of large datastructures out of importer code and into a dedicated static file (`biblio_lookup_tables.py`). Didn't move all, just the ones that were either generic or very large (making it hard to read code)
- shuffled around relative imports and some function names ("clean_str" vs. "clean")
Some actual behavioral changes:
- remove some Datacite-specific license slugs
- stop trying to fix double-slashes in DOIs, that was causing more harm than help (some DOIs do actually have double-slashes!)
- remove some excess metadata from datacite 'extra' fields
|
| |
| |
| |
| |
| |
| |
| | |
- MAX_ABSTRACT_LENGTH set in a single place (importer common)
- merge datacite license slug table in to common table, removing some
TDM-specific licenses (which do not apply in the context of preserving
the full work)
|
| | |
|
| | |
|
|/ |
|
|
|
|
|
| |
This commit just adds the type annotations, doesn't do fixes to code to
make type checking pass.
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
Behavior and motivation described in the kafka json import comment.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
The motivation for this change is to enable passing the 'reason' through
to edit extra metadata, in cases where we merge or cluster releases.
|
|
|
|
| |
Using fuzzycat. Add basic test coverage.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Moved several normalizer helpers out of fatcat_tools.importers.common to
fatcat_tools.normal.
Copied language name and country name parser helpers from chocula
repository (built on existing pycountry helper library).
Have not gone through and refactored other importers to point to these
helpers yet; that should be a separate PR when this branch is merged.
Current changes are backwards compatible via re-imports.
|
| |
|
| |
|
|
|
|
|
|
| |
These should not have any behavior changes, though a number of exception
catches are now more general, and there may be long-tail exceptions
getting thrown in these statements.
|
| |
|
|
|
|
|
|
|
|
| |
One of these (in ingest importer pipeline) is an actual bug, the others
are just changing the syntax to be more explicit/conservative.
The ingest importer bug seems to have resulted in some bad file match
imports; scale of impact is unknown.
|
| |
|
|\
| |
| | |
Correct spelling mistakes
|
| | |
|
|\ \
| |/
|/|
| |
| | |
pubmed and arxiv harvest preparations
See merge request webgroup/fatcat!28
|
| |
| |
| |
| |
| |
| |
| |
| | |
Address kafka tradeoff between long and short time-outs. Shorter
time-outs would facilitate
> consumer group re-balances and other consumer group state changes
[...] in a reasonable human time-frame.
|
| |
| |
| |
| |
| |
| |
| | |
* add PubmedFTPWorker
* utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream)
but may live elsewhere, as they are more generic
* add KafkaBs4XmlPusher
|
|/ |
|
| |
|
| |
|
|
|
|
|
| |
During debugging, it can be helpful to keep stdout (e.g. processing
results) and dignostic messages separate.
|
|\
| |
| |
| |
| | |
Update EntityImporter docstring.
See merge request webgroup/fatcat!9
|
| | |
|
| |
| |
| |
| | |
I believe the required method is `parse_record`, not `parse`.
|
| |
| |
| |
| | |
Also fix a spurious typo.
|
| | |
|
| | |
|
|/
|
|
|
| |
For use with bots that don't have admin privileges, or where human
follow-up review is desired.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
- decrease default changelog pipeline to 5.0sec
- fix missing KafkaException harvester imports
- more confluent-kafka tweaks
- updates to kafka consumer configs
- bump elastic updates consumergroup (again)
|
| |
|