sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	improvements to reliability from prod testing	Bryan Newbold	2020-02-03	1	-2/+9
\|
*	hack-y backoff ingest attempt	Bryan Newbold	2020-02-03	1	-1/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
*	worker kafka setting tweaks	Bryan Newbold	2020-01-28	1	-2/+4
\| \| \| \|	These are all attempts to get kafka workers operating more smoothly.
*	workers: yes, poll is necessary	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix kafka worker partition-specific error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	have JsonLinePusher continue on JSON decode errors (but count)	Bryan Newbold	2020-01-02	1	-1/+5
\|
*	refactor: use print(..., file=sys.stderr)	Bryan Newbold	2019-12-18	1	-20/+22
\| \| \| \|	Should use logging soon, but this seems more idiomatic in the meanwhile.
*	CI: make some jobs manual	Bryan Newbold	2019-11-15	1	-0/+2
\| \| \| \| \|	Scalding test is broken :( But we aren't even using that code much these days.
*	bump kafka max poll interval for consumers	Bryan Newbold	2019-11-14	1	-2/+2
\| \| \| \| \|	The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
*	update ingest-file batch size to 1	Bryan Newbold	2019-11-14	1	-3/+3
\| \| \| \| \| \| \| \|	Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
*	refactor consume_topic name out of make_kafka_consumer()	Bryan Newbold	2019-11-13	1	-5/+5
\| \| \| \|	Best to do this in wrapping code for full flexibility.
*	workers: better generic batch-size arg handling	Bryan Newbold	2019-10-03	1	-0/+6
\|
*	more counts and bugfixes in grobid_tool	Bryan Newbold	2019-09-26	1	-0/+6
\|
*	off-by-one error in batch sizes	Bryan Newbold	2019-09-26	1	-1/+1
\|
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	1	-0/+419