aboutsummaryrefslogtreecommitdiffstats
path: root/python/sandcrawler/workers.py
Commit message (Collapse)AuthorAgeFilesLines
* Revert "reimplement worker timeout with multiprocessing"Bryan Newbold2020-10-221-17/+23
| | | | | | | This reverts commit 031f51752e79dbdde47bbc95fe6b3600c9ec711a. Didn't actually work when testing; can't pickle the Kafka Producer object (and probably other objects)
* reimplement worker timeout with multiprocessingBryan Newbold2020-10-221-23/+17
|
* differential wayback-error from wayback-content-errorBryan Newbold2020-10-211-3/+3
| | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption).
* customize timeout per worker; 120sec for pdf-extractBryan Newbold2020-06-291-1/+2
| | | | | This is a stab-in-the-dark attempt to resolve long timeouts with this worker in prod.
* handle empty fetched blobBryan Newbold2020-06-271-1/+6
|
* CDX KeyError as WaybackError from fetch workerBryan Newbold2020-06-261-1/+1
|
* don't nest generic fetch errors under pdf_trioBryan Newbold2020-06-251-12/+6
| | | | This came from sloppy refactoring (and missing test coverage)
* fixes and tweaks from testing locallyBryan Newbold2020-06-171-2/+2
|
* workers: refactor to pass key to process()Bryan Newbold2020-06-171-7/+15
|
* refactor worker fetch code into wrapper classBryan Newbold2020-06-161-1/+88
|
* rename KafkaGrobidSink -> KafkaCompressSinkBryan Newbold2020-06-161-1/+1
|
* workers: add missing want() dataflow pathBryan Newbold2020-04-301-0/+9
|
* timeouts: don't push through None error messagesBryan Newbold2020-04-291-2/+2
|
* worker timeout wrapper, and use for kafkaBryan Newbold2020-04-271-2/+40
|
* batch/multiprocess for ZipfilePusherBryan Newbold2020-04-161-3/+18
|
* workers: add explicit process to base classMartin Czygan2020-03-121-0/+6
| | | | | | | | As per https://docs.python.org/3/library/exceptions.html#NotImplementedError > In user defined base classes, abstract methods should raise this exception when they require derived classes to override the method [...].
* improvements to reliability from prod testingBryan Newbold2020-02-031-2/+9
|
* hack-y backoff ingest attemptBryan Newbold2020-02-031-1/+15
| | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
* worker kafka setting tweaksBryan Newbold2020-01-281-2/+4
| | | | These are all attempts to get kafka workers operating more smoothly.
* workers: yes, poll is necessaryBryan Newbold2020-01-281-1/+1
|
* fix kafka worker partition-specific errorBryan Newbold2020-01-281-1/+1
|
* have JsonLinePusher continue on JSON decode errors (but count)Bryan Newbold2020-01-021-1/+5
|
* refactor: use print(..., file=sys.stderr)Bryan Newbold2019-12-181-20/+22
| | | | Should use logging soon, but this seems more idiomatic in the meanwhile.
* CI: make some jobs manualBryan Newbold2019-11-151-0/+2
| | | | | Scalding test is broken :( But we aren't even using that code much these days.
* bump kafka max poll interval for consumersBryan Newbold2019-11-141-2/+2
| | | | | The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit.
* update ingest-file batch size to 1Bryan Newbold2019-11-141-3/+3
| | | | | | | | Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing.
* refactor consume_topic name out of make_kafka_consumer()Bryan Newbold2019-11-131-5/+5
| | | | Best to do this in wrapping code for full flexibility.
* workers: better generic batch-size arg handlingBryan Newbold2019-10-031-0/+6
|
* more counts and bugfixes in grobid_toolBryan Newbold2019-09-261-0/+6
|
* off-by-one error in batch sizesBryan Newbold2019-09-261-1/+1
|
* lots of grobid tool implementation (still WIP)Bryan Newbold2019-09-261-0/+419