aboutsummaryrefslogtreecommitdiffstats
path: root/python/ia_pdf_match.py
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2020-02-03 20:42:15 -0800
committerBryan Newbold <bnewbold@archive.org>2020-02-03 21:51:00 -0800
commit15c7e430ebccbdab88355c5c1f1914c3aca99c8a (patch)
tree6ce8d01b11f6e2c6792138046219c7e624aa2d0d /python/ia_pdf_match.py
parent5f9e7fd4c89df98ed90be9629d3dc6c201b42a02 (diff)
downloadsandcrawler-15c7e430ebccbdab88355c5c1f1914c3aca99c8a.tar.gz
sandcrawler-15c7e430ebccbdab88355c5c1f1914c3aca99c8a.zip
hack-y backoff ingest attempt
The goal here is to have SPNv2 requests backoff when we get back-pressure (usually caused by some sessions taking too long). Lack of proper back-pressure is making it hard to turn up parallelism. This is a hack because we still timeout and drop the slow request. A better way is probably to have a background thread run, while the KafkaPusher thread does polling. Maybe with timeouts to detect slow processing (greater than 30 seconds?) and only pause/resume in that case. This would also make taking batches easier. Unlike the existing code, however, the parallelism needs to happen at the Pusher level to do the polling (Kafka) and "await" (for all worker threads to complete) correctly.
Diffstat (limited to 'python/ia_pdf_match.py')
0 files changed, 0 insertions, 0 deletions