aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/workers.py
Commit message (Collapse)AuthorAgeFilesLines
* improvements to reliability from prod testingBryan Newbold2020-02-031-2/+9
|
* hack-y backoff ingest attemptBryan Newbold2020-02-031-1/+15
| | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
* worker kafka setting tweaksBryan Newbold2020-01-281-2/+4
| | | | These are all attempts to get kafka workers operating more smoothly.
* workers: yes, poll is necessaryBryan Newbold2020-01-281-1/+1
|
* fix kafka worker partition-specific errorBryan Newbold2020-01-281-1/+1
|
* have JsonLinePusher continue on JSON decode errors (but count)Bryan Newbold2020-01-021-1/+5
|
* refactor: use print(..., file=sys.stderr)Bryan Newbold2019-12-181-20/+22
| | | | Should use logging soon, but this seems more idiomatic in the meanwhile.
* CI: make some jobs manualBryan Newbold2019-11-151-0/+2
| | | | | Scalding test is broken :( But we aren't even using that code much these days.
* bump kafka max poll interval for consumersBryan Newbold2019-11-141-2/+2
| | | | | The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
* update ingest-file batch size to 1Bryan Newbold2019-11-141-3/+3
| | | | | | | | Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
* refactor consume_topic name out of make_kafka_consumer()Bryan Newbold2019-11-131-5/+5
| | | | Best to do this in wrapping code for full flexibility.
* workers: better generic batch-size arg handlingBryan Newbold2019-10-031-0/+6
|
* more counts and bugfixes in grobid_toolBryan Newbold2019-09-261-0/+6
|
* off-by-one error in batch sizesBryan Newbold2019-09-261-1/+1
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-0/+419