Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | small lint/typo/fmt fixes | Bryan Newbold | 2022-02-24 | 1 | -1/+1 |
| | |||||
* | codespell typos in python (comments) | Bryan Newbold | 2021-11-24 | 1 | -1/+1 |
| | |||||
* | workers: use HTTP session for archive.org fetches | Bryan Newbold | 2021-11-03 | 1 | -3/+3 |
| | |||||
* | make fmt (black 21.9b0) | Bryan Newbold | 2021-10-27 | 1 | -128/+161 |
| | |||||
* | fix type annotations for petabox body fetch helper | Bryan Newbold | 2021-10-26 | 1 | -4/+4 |
| | |||||
* | more progress on type annotations | Bryan Newbold | 2021-10-26 | 1 | -1/+1 |
| | |||||
* | type annotations on SandcrawlerWorker | Bryan Newbold | 2021-10-26 | 1 | -46/+57 |
| | | | | | These annoations have a broad impact! Being conservative to start: Any-to-Any for process(), etc. | ||||
* | flake8 clean (with current settings) | Bryan Newbold | 2021-10-26 | 1 | -6/+10 |
| | |||||
* | start handling trivial lint cleanups: unused imports, 'is None', etc | Bryan Newbold | 2021-10-26 | 1 | -1/+0 |
| | |||||
* | make fmt | Bryan Newbold | 2021-10-26 | 1 | -35/+25 |
| | |||||
* | python: isort all imports | Bryan Newbold | 2021-10-26 | 1 | -6/+7 |
| | |||||
* | Revert "reimplement worker timeout with multiprocessing" | Bryan Newbold | 2020-10-22 | 1 | -17/+23 |
| | | | | | | | This reverts commit 031f51752e79dbdde47bbc95fe6b3600c9ec711a. Didn't actually work when testing; can't pickle the Kafka Producer object (and probably other objects) | ||||
* | reimplement worker timeout with multiprocessing | Bryan Newbold | 2020-10-22 | 1 | -23/+17 |
| | |||||
* | differential wayback-error from wayback-content-error | Bryan Newbold | 2020-10-21 | 1 | -3/+3 |
| | | | | | | The motivation here is to distinguish errors due to current content in wayback (eg, in WARCs) from operational errors (eg, wayback machine is down, or network failures/disruption). | ||||
* | customize timeout per worker; 120sec for pdf-extract | Bryan Newbold | 2020-06-29 | 1 | -1/+2 |
| | | | | | This is a stab-in-the-dark attempt to resolve long timeouts with this worker in prod. | ||||
* | handle empty fetched blob | Bryan Newbold | 2020-06-27 | 1 | -1/+6 |
| | |||||
* | CDX KeyError as WaybackError from fetch worker | Bryan Newbold | 2020-06-26 | 1 | -1/+1 |
| | |||||
* | don't nest generic fetch errors under pdf_trio | Bryan Newbold | 2020-06-25 | 1 | -12/+6 |
| | | | | This came from sloppy refactoring (and missing test coverage) | ||||
* | fixes and tweaks from testing locally | Bryan Newbold | 2020-06-17 | 1 | -2/+2 |
| | |||||
* | workers: refactor to pass key to process() | Bryan Newbold | 2020-06-17 | 1 | -7/+15 |
| | |||||
* | refactor worker fetch code into wrapper class | Bryan Newbold | 2020-06-16 | 1 | -1/+88 |
| | |||||
* | rename KafkaGrobidSink -> KafkaCompressSink | Bryan Newbold | 2020-06-16 | 1 | -1/+1 |
| | |||||
* | workers: add missing want() dataflow path | Bryan Newbold | 2020-04-30 | 1 | -0/+9 |
| | |||||
* | timeouts: don't push through None error messages | Bryan Newbold | 2020-04-29 | 1 | -2/+2 |
| | |||||
* | worker timeout wrapper, and use for kafka | Bryan Newbold | 2020-04-27 | 1 | -2/+40 |
| | |||||
* | batch/multiprocess for ZipfilePusher | Bryan Newbold | 2020-04-16 | 1 | -3/+18 |
| | |||||
* | workers: add explicit process to base class | Martin Czygan | 2020-03-12 | 1 | -0/+6 |
| | | | | | | | | As per https://docs.python.org/3/library/exceptions.html#NotImplementedError > In user defined base classes, abstract methods should raise this exception when they require derived classes to override the method [...]. | ||||
* | improvements to reliability from prod testing | Bryan Newbold | 2020-02-03 | 1 | -2/+9 |
| | |||||
* | hack-y backoff ingest attempt | Bryan Newbold | 2020-02-03 | 1 | -1/+15 |
| | | | | | | | | | | | | | | | The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly. | ||||
* | worker kafka setting tweaks | Bryan Newbold | 2020-01-28 | 1 | -2/+4 |
| | | | | These are all attempts to get kafka workers operating more smoothly. | ||||
* | workers: yes, poll is necessary | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | fix kafka worker partition-specific error | Bryan Newbold | 2020-01-28 | 1 | -1/+1 |
| | |||||
* | have JsonLinePusher continue on JSON decode errors (but count) | Bryan Newbold | 2020-01-02 | 1 | -1/+5 |
| | |||||
* | refactor: use print(..., file=sys.stderr) | Bryan Newbold | 2019-12-18 | 1 | -20/+22 |
| | | | | Should use logging soon, but this seems more idiomatic in the meanwhile. | ||||
* | CI: make some jobs manual | Bryan Newbold | 2019-11-15 | 1 | -0/+2 |
| | | | | | Scalding test is broken :( But we aren't even using that code much these days. | ||||
* | bump kafka max poll interval for consumers | Bryan Newbold | 2019-11-14 | 1 | -2/+2 |
| | | | | | The ingest worker keeps timing out at just over 5 minutes, so bump it just a bit. | ||||
* | update ingest-file batch size to 1 | Bryan Newbold | 2019-11-14 | 1 | -3/+3 |
| | | | | | | | | Was defaulting to 100, which I think was resulting in lots of consumer group timeouts, resulting in UNKNOWN_MEMBER_ID errors. Will probably switch back to batches of 10 or so, but multi-processing or some other concurrent dispatch/processing. | ||||
* | refactor consume_topic name out of make_kafka_consumer() | Bryan Newbold | 2019-11-13 | 1 | -5/+5 |
| | | | | Best to do this in wrapping code for full flexibility. | ||||
* | workers: better generic batch-size arg handling | Bryan Newbold | 2019-10-03 | 1 | -0/+6 |
| | |||||
* | more counts and bugfixes in grobid_tool | Bryan Newbold | 2019-09-26 | 1 | -0/+6 |
| | |||||
* | off-by-one error in batch sizes | Bryan Newbold | 2019-09-26 | 1 | -1/+1 |
| | |||||
* | lots of grobid tool implementation (still WIP) | Bryan Newbold | 2019-09-26 | 1 | -0/+419 |