summaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/common.py
Commit message (Collapse)AuthorAgeFilesLines
* kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
| | | | Behavior and motivation described in the kafka json import comment.
* importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
|
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-2/+0
|
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
|
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-171-15/+28
|
* update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-171-3/+3
| | | | | The motivation for this change is to enable passing the 'reason' through to edit extra metadata, in cases where we merge or cluster releases.
* add fuzzy matching helper to importer base classBryan Newbold2020-12-161-2/+62
| | | | Using fuzzycat. Add basic test coverage.
* more python normalizers, and move from importer commonBryan Newbold2020-11-191-154/+4
| | | | | | | | | | | | Moved several normalizer helpers out of fatcat_tools.importers.common to fatcat_tools.normal. Copied language name and country name parser helpers from chocula repository (built on existing pycountry helper library). Have not gone through and refactored other importers to point to these helpers yet; that should be a separate PR when this branch is merged. Current changes are backwards compatible via re-imports.
* remove spurious print statementBryan Newbold2020-09-031-1/+0
|
* generic file entity clean-ups as part of file_meta importerBryan Newbold2020-09-021-0/+47
|
* simple lint (flake8) fixes over python codebaseBryan Newbold2020-07-231-1/+0
| | | | | | These should not have any behavior changes, though a number of exception catches are now more general, and there may be long-tail exceptions getting thrown in these statements.
* lint (flake8) tool python filesBryan Newbold2020-07-011-13/+13
|
* importers: clarify handling of ApiExceptionBryan Newbold2020-05-221-4/+8
| | | | | | | | One of these (in ingest importer pipeline) is an actual bug, the others are just changing the syntax to be more explicit/conservative. The ingest importer bug seems to have resulted in some bad file match imports; scale of impact is unknown.
* consistently use raw string prefix for regexBryan Newbold2020-04-171-1/+1
|
* Merge pull request #53 from EdwardBetts/spellingbnewbold2020-03-271-1/+1
|\ | | | | Correct spelling mistakes
| * Correct spelling mistakesEdward Betts2020-03-271-1/+1
| |
* | Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-101-0/+65
|\ \ | |/ |/| | | | | pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
| * common: use smaller batch size since XML parsing may be slowMartin Czygan2020-03-101-1/+1
| | | | | | | | | | | | | | | | Address kafka tradeoff between long and short time-outs. Shorter time-outs would facilitate > consumer group re-balances and other consumer group state changes [...] in a reasonable human time-frame.
| * pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-191-0/+65
| | | | | | | | | | | | | | * add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher
* | add some more domain/rel URL mappingsBryan Newbold2020-02-221-0/+9
|/
* fix KafkaError worker reporting for partition errorsBryan Newbold2020-01-291-1/+1
|
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-0/+1
|
* write diagnostic messages to stderrMartin Czygan2019-12-161-2/+2
| | | | | During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
* Merge branch 'martin-importers-common-doc-fix' into 'master'Martin Czygan2019-12-141-13/+10
|\ | | | | | | | | Update EntityImporter docstring. See merge request webgroup/fatcat!9
| * complete parse_record docstringMartin Czygan2019-12-141-0/+6
| |
| * Update EntityImporter docstring.Martin Czygan2019-12-131-13/+4
| | | | | | | | I believe the required method is `parse_record`, not `parse`.
* | revert accidentally commited test timingBryan Newbold2019-12-131-2/+2
| | | | | | | | Also fix a spurious typo.
* | ensure importer description arg isn't clobberedBryan Newbold2019-12-121-1/+3
| |
* | flush importer editgroups every few minutesBryan Newbold2019-12-121-5/+20
| |
* | EntityImporter: submit (not accept) modeBryan Newbold2019-12-121-2/+14
|/ | | | | For use with bots that don't have admin privileges, or where human follow-up review is desired.
* crude support for 'sandcrawler' kafka topicsBryan Newbold2019-11-151-2/+3
|
* refactor duplicated b32_hex function in importersBryan Newbold2019-10-081-0/+9
|
* review/fix all confluent-kafka produce codeBryan Newbold2019-09-201-1/+0
|
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-201-10/+24
| | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
* small kafka tweaks for robustnessBryan Newbold2019-09-201-0/+3
|
* convert importers to confluent-kafka libraryBryan Newbold2019-09-201-19/+71
|
* refactor all python source for client lib nameBryan Newbold2019-09-051-3/+3
|
* fix Importer editgroup_extra pass-throughBryan Newbold2019-09-051-2/+1
|
* file rel: social -> academicsocialBryan Newbold2019-09-031-2/+2
|
* better importer 'total' countingBryan Newbold2019-09-031-4/+2
|
* make importer extid lookups faster by hidingBryan Newbold2019-05-291-2/+2
|
* is_cjk() handles kanji betterBryan Newbold2019-05-291-4/+6
|
* faster LargeFile XML importer for PubMedBryan Newbold2019-05-291-0/+50
|
* more MARC languages, and less verbose reportingBryan Newbold2019-05-241-3/+14
|
* missing MARC/ISO languagesBryan Newbold2019-05-221-0/+2
|
* Gaelic!Bryan Newbold2019-05-221-0/+3
|
* creative importer for bulk JSTOR importsBryan Newbold2019-05-221-0/+22
|
* bs4 XML parse cleanupBryan Newbold2019-05-221-0/+2
|
* JALC bulk file importerBryan Newbold2019-05-211-0/+20
|
* updates to pubmed importerBryan Newbold2019-05-211-1/+20
|