aboutsummaryrefslogtreecommitdiffstats
path: root/python/fatcat_tools/importers/common.py
Commit message (Expand)AuthorAgeFilesLines
* fix typo in fileset comparison helperBryan Newbold2022-03-231-1/+1
* ingest fileset fixes, and some test coverageBryan Newbold2022-03-231-0/+11
* codespell fixes in python code (comments)Bryan Newbold2021-11-241-2/+2
* Merge branch 'bnewbold-import-refactors' into 'master'bnewbold2021-11-111-65/+4
|\
| * refactor importer metadata tables into separate file; move some helpers aroundBryan Newbold2021-11-101-59/+2
| * importers: refactor imports of clean() and other normalization helpersBryan Newbold2021-11-101-4/+1
| * importers: use clean_doi() in many more (all?) importersBryan Newbold2021-11-091-3/+2
* | imports: generic file cleanup removes exact duplicate URLsBryan Newbold2021-11-091-0/+9
|/
* typing: initial annotations on importersBryan Newbold2021-11-031-47/+99
* re-fix some lint issues after big 'fmt'Bryan Newbold2021-11-021-2/+2
* fmt (black): fatcat_tools/Bryan Newbold2021-11-021-92/+106
* python: isort everythingBryan Newbold2021-11-021-12/+12
* small python tweaks for annotations, importsBryan Newbold2021-11-021-1/+1
* try some type annotationsBryan Newbold2021-11-021-33/+34
* generic fileset importer class, with test coverageBryan Newbold2021-10-141-0/+4
* kafka import: optional 'force-flush' mode for some importersBryan Newbold2021-10-011-0/+13
* importer common: more verbose logging (with counts)Bryan Newbold2021-10-011-4/+4
* small python lint fixes (no behavior change)Bryan Newbold2021-05-251-2/+0
* fuzzy: set 120 second timeout on ES lookupsBryan Newbold2020-12-231-1/+1
* add 'lxml' mode for large XML file import, and multi-tagsBryan Newbold2020-12-171-15/+28
* update fuzzy helper to pass 'reason' through to import codeBryan Newbold2020-12-171-3/+3
* add fuzzy matching helper to importer base classBryan Newbold2020-12-161-2/+62
* more python normalizers, and move from importer commonBryan Newbold2020-11-191-154/+4
* remove spurious print statementBryan Newbold2020-09-031-1/+0
* generic file entity clean-ups as part of file_meta importerBryan Newbold2020-09-021-0/+47
* simple lint (flake8) fixes over python codebaseBryan Newbold2020-07-231-1/+0
* lint (flake8) tool python filesBryan Newbold2020-07-011-13/+13
* importers: clarify handling of ApiExceptionBryan Newbold2020-05-221-4/+8
* consistently use raw string prefix for regexBryan Newbold2020-04-171-1/+1
* Merge pull request #53 from EdwardBetts/spellingbnewbold2020-03-271-1/+1
|\
| * Correct spelling mistakesEdward Betts2020-03-271-1/+1
* | Merge branch 'martin-kafka-bs4-import' into 'master'Martin Czygan2020-03-101-0/+65
|\ \ | |/ |/|
| * common: use smaller batch size since XML parsing may be slowMartin Czygan2020-03-101-1/+1
| * pubmed ftp harvest and KafkaBs4XmlPusherMartin Czygan2020-02-191-0/+65
* | add some more domain/rel URL mappingsBryan Newbold2020-02-221-0/+9
|/
* fix KafkaError worker reporting for partition errorsBryan Newbold2020-01-291-1/+1
* importers: control update behavior with more-standard flagBryan Newbold2020-01-061-0/+1
* write diagnostic messages to stderrMartin Czygan2019-12-161-2/+2
* Merge branch 'martin-importers-common-doc-fix' into 'master'Martin Czygan2019-12-141-13/+10
|\
| * complete parse_record docstringMartin Czygan2019-12-141-0/+6
| * Update EntityImporter docstring.Martin Czygan2019-12-131-13/+4
* | revert accidentally commited test timingBryan Newbold2019-12-131-2/+2
* | ensure importer description arg isn't clobberedBryan Newbold2019-12-121-1/+3
* | flush importer editgroups every few minutesBryan Newbold2019-12-121-5/+20
* | EntityImporter: submit (not accept) modeBryan Newbold2019-12-121-2/+14
|/
* crude support for 'sandcrawler' kafka topicsBryan Newbold2019-11-151-2/+3
* refactor duplicated b32_hex function in importersBryan Newbold2019-10-081-0/+9
* review/fix all confluent-kafka produce codeBryan Newbold2019-09-201-1/+0
* small fixes to confluent-kafka importers/workersBryan Newbold2019-09-201-10/+24
* small kafka tweaks for robustnessBryan Newbold2019-09-201-0/+3