aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/common.py
Commit message (Expand)AuthorAgeFilesLines
* Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-101-0/+65
|\
| * common: use smaller batch size since XML parsing may be slowMartin Czygan2020-03-101-1/+1
| * pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-191-0/+65
* | add some more domain/rel URL mappingsBryan Newbold2020-02-221-0/+9
|/
* fix KafkaError worker reporting for partition errorsBryan Newbold2020-01-291-1/+1
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-0/+1
* write diagnostic messages to stderrMartin Czygan2019-12-161-2/+2
* Merge branch 'martin-importers-common-doc-fix' into 'master'Martin Czygan2019-12-141-13/+10
|\
| * complete parse_record docstringMartin Czygan2019-12-141-0/+6
| * Update EntityImporter docstring.Martin Czygan2019-12-131-13/+4
* | revert accidentally commited test timingBryan Newbold2019-12-131-2/+2
* | ensure importer description arg isn't clobberedBryan Newbold2019-12-121-1/+3
* | flush importer editgroups every few minutesBryan Newbold2019-12-121-5/+20
* | EntityImporter: submit (not accept) modeBryan Newbold2019-12-121-2/+14
|/
* crude support for 'sandcrawler' kafka topicsBryan Newbold2019-11-151-2/+3
* refactor duplicated b32_hex function in importersBryan Newbold2019-10-081-0/+9
* review/fix all confluent-kafka produce codeBryan Newbold2019-09-201-1/+0
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-201-10/+24
* small kafka tweaks for robustnessBryan Newbold2019-09-201-0/+3
* convert importers to confluent-kafka libraryBryan Newbold2019-09-201-19/+71
* refactor all python source for client lib nameBryan Newbold2019-09-051-3/+3
* fix Importer editgroup_extra pass-throughBryan Newbold2019-09-051-2/+1
* file rel: social -> academicsocialBryan Newbold2019-09-031-2/+2
* better importer 'total' countingBryan Newbold2019-09-031-4/+2
* make importer extid lookups faster by hidingBryan Newbold2019-05-291-2/+2
* is_cjk() handles kanji betterBryan Newbold2019-05-291-4/+6
* faster LargeFile XML importer for PubMedBryan Newbold2019-05-291-0/+50
* more MARC languages, and less verbose reportingBryan Newbold2019-05-241-3/+14
* missing MARC/ISO languagesBryan Newbold2019-05-221-0/+2
* Gaelic!Bryan Newbold2019-05-221-0/+3
* creative importer for bulk JSTOR importsBryan Newbold2019-05-221-0/+22
* bs4 XML parse cleanupBryan Newbold2019-05-221-0/+2
* JALC bulk file importerBryan Newbold2019-05-211-0/+20
* updates to pubmed importerBryan Newbold2019-05-211-1/+20
* tweaks to new imports/testsBryan Newbold2019-05-211-5/+78
* initial flesh out of JALC parserBryan Newbold2019-05-211-0/+36
* python implBryan Newbold2019-05-141-3/+3
* add limits to match importersBryan Newbold2019-04-231-0/+3
* archive.org isn't really a repositoryBryan Newbold2019-04-221-1/+3
* mechanism to not double-update entitiesBryan Newbold2019-04-181-0/+3
* update URL rel listBryan Newbold2019-04-181-1/+10
* add SqlitePusher importer optionBryan Newbold2019-04-121-0/+20
* bunch of lint/whitespace cleanupsBryan Newbold2019-02-221-2/+2
* fix bug in clean() resulting in many consistency check failsBryan Newbold2019-01-291-2/+3
* add stub parse_record() to make pylint happyBryan Newbold2019-01-281-0/+4
* don't allow empty or single-character clean stringsBryan Newbold2019-01-281-1/+1
* transform and import fixes/tweaksBryan Newbold2019-01-251-2/+2
* refactor _get_editgroup => get_editgroup_idBryan Newbold2019-01-241-4/+5
* refactor make_rel_urlBryan Newbold2019-01-241-0/+60
* clean() checks if it returns null-length stringBryan Newbold2019-01-231-1/+5