aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/workers.py
Commit message (Collapse)AuthorAgeFilesLines
* have JsonLinePusher continue on JSON decode errors (but count)Bryan Newbold2020-01-021-1/+5
|
* refactor: use print(..., file=sys.stderr)Bryan Newbold2019-12-181-20/+22
| | | | Should use logging soon, but this seems more idiomatic in the meanwhile.
* CI: make some jobs manualBryan Newbold2019-11-151-0/+2
| | | | | Scalding test is broken :( But we aren't even using that code much these days.
* bump kafka max poll interval for consumersBryan Newbold2019-11-141-2/+2
| | | | | The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
* update ingest-file batch size to 1Bryan Newbold2019-11-141-3/+3
| | | | | | | | Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
* refactor consume_topic name out of make_kafka_consumer()Bryan Newbold2019-11-131-5/+5
| | | | Best to do this in wrapping code for full flexibility.
* workers: better generic batch-size arg handlingBryan Newbold2019-10-031-0/+6
|
* more counts and bugfixes in grobid_toolBryan Newbold2019-09-261-0/+6
|
* off-by-one error in batch sizesBryan Newbold2019-09-261-1/+1
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-0/+419