aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler_worker.py
Commit message (Collapse)AuthorAgeFilesLines
* pdftrio basic python codeBryan Newbold2020-02-121-0/+19
| | | | This is basically just a copy/paste of GROBID code, only simpler!
* sandcrawler_worker: ingest worker distinct consumer groupsBryan Newbold2020-01-291-1/+3
| | | | | | I'm in the process of resetting these consumer groups, so might as well take the opportunity to split by topic and use the new canonical naming format.
* make grobid-extract worker batch size 1Bryan Newbold2020-01-281-0/+1
| | | | | This is part of attempts to fix Kafka errors that look like they might be timeouts.
* improve sentry reporting with 'release' git hashBryan Newbold2020-01-151-1/+5
|
* bulk ingest file request topic supportBryan Newbold2020-01-141-1/+7
|
* grobid-to-kafka support in ingest workerBryan Newbold2020-01-141-0/+6
|
* update persist worker invocation to use batchesBryan Newbold2020-01-021-15/+55
|
* fix sandcrawler persist workersBryan Newbold2020-01-021-8/+36
|
* start work on persist workers and toolBryan Newbold2020-01-021-5/+15
|
* refactor: improve argparse usageBryan Newbold2019-12-181-4/+8
| | | | | use ArgumentDefaultsHelpFormatter and add help messages to all sub-commands
* update ingest-file batch size to 1Bryan Newbold2019-11-141-1/+1
| | | | | | | | Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
* fix lint errorsBryan Newbold2019-11-131-5/+10
|
* correct ingest-file consumer groupBryan Newbold2019-11-131-1/+1
|
* add basic sandcrawler worker (kafka)Bryan Newbold2019-11-131-0/+74