sandcrawler - [no description]

	Commit message (Collapse)	Author	Age	Files	Lines
*	Revert "reimplement worker timeout with multiprocessing"	Bryan Newbold	2020-10-22	1	-17/+23
\| \| \| \| \| \| \|	This reverts commit 031f51752e79dbdde47bbc95fe6b3600c9ec711a. Didn't actually work when testing; can't pickle the Kafka Producer object (and probably other objects)
*	reimplement worker timeout with multiprocessing	Bryan Newbold	2020-10-22	1	-23/+17
\|
*	differential wayback-error from wayback-content-error	Bryan Newbold	2020-10-21	1	-3/+3
\| \| \| \| \| \|	The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
*	customize timeout per worker; 120sec for pdf-extract	Bryan Newbold	2020-06-29	1	-1/+2
\| \| \| \| \|	This is a stab-in-the-dark attempt to resolve long timeouts with this worker in prod.
*	handle empty fetched blob	Bryan Newbold	2020-06-27	1	-1/+6
\|
*	CDX KeyError as WaybackError from fetch worker	Bryan Newbold	2020-06-26	1	-1/+1
\|
*	don't nest generic fetch errors under pdf_trio	Bryan Newbold	2020-06-25	1	-12/+6
\| \| \| \|	This came from sloppy refactoring (and missing test coverage)
*	fixes and tweaks from testing locally	Bryan Newbold	2020-06-17	1	-2/+2
\|
*	workers: refactor to pass key to process()	Bryan Newbold	2020-06-17	1	-7/+15
\|
*	refactor worker fetch code into wrapper class	Bryan Newbold	2020-06-16	1	-1/+88
\|
*	rename KafkaGrobidSink -> KafkaCompressSink	Bryan Newbold	2020-06-16	1	-1/+1
\|
*	workers: add missing want() dataflow path	Bryan Newbold	2020-04-30	1	-0/+9
\|
*	timeouts: don't push through None error messages	Bryan Newbold	2020-04-29	1	-2/+2
\|
*	worker timeout wrapper, and use for kafka	Bryan Newbold	2020-04-27	1	-2/+40
\|
*	batch/multiprocess for ZipfilePusher	Bryan Newbold	2020-04-16	1	-3/+18
\|
*	workers: add explicit process to base class	Martin Czygan	2020-03-12	1	-0/+6
\| \| \| \| \| \| \| \|	As per https://docs.python.org/3/library/exceptions.html#NotImplementedError > In user defined base classes, abstract methods should raise this exception when they require derived classes to override the method [...].
*	improvements to reliability from prod testing	Bryan Newbold	2020-02-03	1	-2/+9
\|
*	hack-y backoff ingest attempt	Bryan Newbold	2020-02-03	1	-1/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
*	worker kafka setting tweaks	Bryan Newbold	2020-01-28	1	-2/+4
\| \| \| \|	These are all attempts to get kafka workers operating more smoothly.
*	workers: yes, poll is necessary	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	fix kafka worker partition-specific error	Bryan Newbold	2020-01-28	1	-1/+1
\|
*	have JsonLinePusher continue on JSON decode errors (but count)	Bryan Newbold	2020-01-02	1	-1/+5
\|
*	refactor: use print(..., file=sys.stderr)	Bryan Newbold	2019-12-18	1	-20/+22
\| \| \| \|	Should use logging soon, but this seems more idiomatic in the meanwhile.
*	CI: make some jobs manual	Bryan Newbold	2019-11-15	1	-0/+2
\| \| \| \| \|	Scalding test is broken :( But we aren't even using that code much these days.
*	bump kafka max poll interval for consumers	Bryan Newbold	2019-11-14	1	-2/+2
\| \| \| \| \|	The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
*	update ingest-file batch size to 1	Bryan Newbold	2019-11-14	1	-3/+3
\| \| \| \| \| \| \| \|	Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
*	refactor consume_topic name out of make_kafka_consumer()	Bryan Newbold	2019-11-13	1	-5/+5
\| \| \| \|	Best to do this in wrapping code for full flexibility.
*	workers: better generic batch-size arg handling	Bryan Newbold	2019-10-03	1	-0/+6
\|
*	more counts and bugfixes in grobid_tool	Bryan Newbold	2019-09-26	1	-0/+6
\|
*	off-by-one error in batch sizes	Bryan Newbold	2019-09-26	1	-1/+1
\|
*	lots of grobid tool implementation (still WIP)	Bryan Newbold	2019-09-26	1	-0/+419