aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/common.py
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-101-0/+65
|\ | | | | | | | | pubmed and arxiv harvest preparations See merge request webgroup/fatcat!28
| * common: use smaller batch size since XML parsing may be slowMartin Czygan2020-03-101-1/+1
| | | | | | | | | | | | | | | | Address kafka tradeoff between long and short time-outs. Shorter time-outs would facilitate > consumer group re-balances and other consumer group state changes [...] in a reasonable human time-frame.
| * pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-191-0/+65
| | | | | | | | | | | | | | * add PubmedFTPWorker * utils are currently stored alongside pubmed (e.g. ftpretr, xmlstream) but may live elsewhere, as they are more generic * add KafkaBs4XmlPusher
* | add some more domain/rel URL mappingsBryan Newbold2020-02-221-0/+9
|/
* fix KafkaError worker reporting for partition errorsBryan Newbold2020-01-291-1/+1
|
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-0/+1
|
* write diagnostic messages to stderrMartin Czygan2019-12-161-2/+2
| | | | | During debugging, it can be helpful to keep stdout (e.g. processing results) and dignostic messages separate.
* Merge branch 'martin-importers-common-doc-fix' into 'master'Martin Czygan2019-12-141-13/+10
|\ | | | | | | | | Update EntityImporter docstring. See merge request webgroup/fatcat!9
| * complete parse_record docstringMartin Czygan2019-12-141-0/+6
| |
| * Update EntityImporter docstring.Martin Czygan2019-12-131-13/+4
| | | | | | | | I believe the required method is `parse_record`, not `parse`.
* | revert accidentally commited test timingBryan Newbold2019-12-131-2/+2
| | | | | | | | Also fix a spurious typo.
* | ensure importer description arg isn't clobberedBryan Newbold2019-12-121-1/+3
| |
* | flush importer editgroups every few minutesBryan Newbold2019-12-121-5/+20
| |
* | EntityImporter: submit (not accept) modeBryan Newbold2019-12-121-2/+14
|/ | | | | For use with bots that don't have admin privileges, or where human follow-up review is desired.
* crude support for 'sandcrawler' kafka topicsBryan Newbold2019-11-151-2/+3
|
* refactor duplicated b32_hex function in importersBryan Newbold2019-10-081-0/+9
|
* review/fix all confluent-kafka produce codeBryan Newbold2019-09-201-1/+0
|
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-201-10/+24
| | | | | | | | - decrease default changelog pipeline to 5.0sec - fix missing KafkaException harvester imports - more confluent-kafka tweaks - updates to kafka consumer configs - bump elastic updates consumergroup (again)
* small kafka tweaks for robustnessBryan Newbold2019-09-201-0/+3
|
* convert importers to confluent-kafka libraryBryan Newbold2019-09-201-19/+71
|
* refactor all python source for client lib nameBryan Newbold2019-09-051-3/+3
|
* fix Importer editgroup_extra pass-throughBryan Newbold2019-09-051-2/+1
|
* file rel: social -> academicsocialBryan Newbold2019-09-031-2/+2
|
* better importer 'total' countingBryan Newbold2019-09-031-4/+2
|
* make importer extid lookups faster by hidingBryan Newbold2019-05-291-2/+2
|
* is_cjk() handles kanji betterBryan Newbold2019-05-291-4/+6
|
* faster LargeFile XML importer for PubMedBryan Newbold2019-05-291-0/+50
|
* more MARC languages, and less verbose reportingBryan Newbold2019-05-241-3/+14
|
* missing MARC/ISO languagesBryan Newbold2019-05-221-0/+2
|
* Gaelic!Bryan Newbold2019-05-221-0/+3
|
* creative importer for bulk JSTOR importsBryan Newbold2019-05-221-0/+22
|
* bs4 XML parse cleanupBryan Newbold2019-05-221-0/+2
|
* JALC bulk file importerBryan Newbold2019-05-211-0/+20
|
* updates to pubmed importerBryan Newbold2019-05-211-1/+20
|
* tweaks to new imports/testsBryan Newbold2019-05-211-5/+78
|
* initial flesh out of JALC parserBryan Newbold2019-05-211-0/+36
|
* python implBryan Newbold2019-05-141-3/+3
|
* add limits to match importersBryan Newbold2019-04-231-0/+3
|
* archive.org isn't really a repositoryBryan Newbold2019-04-221-1/+3
|
* mechanism to not double-update entitiesBryan Newbold2019-04-181-0/+3
|
* update URL rel listBryan Newbold2019-04-181-1/+10
|
* add SqlitePusher importer optionBryan Newbold2019-04-121-0/+20
|
* bunch of lint/whitespace cleanupsBryan Newbold2019-02-221-2/+2
|
* fix bug in clean() resulting in many consistency check failsBryan Newbold2019-01-291-2/+3
|
* add stub parse_record() to make pylint happyBryan Newbold2019-01-281-0/+4
|
* don't allow empty or single-character clean stringsBryan Newbold2019-01-281-1/+1
|
* transform and import fixes/tweaksBryan Newbold2019-01-251-2/+2
|
* refactor _get_editgroup => get_editgroup_idBryan Newbold2019-01-241-4/+5
|
* refactor make_rel_urlBryan Newbold2019-01-241-0/+60
|
* clean() checks if it returns null-length stringBryan Newbold2019-01-231-1/+5
|