kafka notes

author: Bryan Newbold <bnewbold@archive.org> 2018-11-20 14:19:40 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2018-11-20 14:19:40 -0800
commit: 7186eb098b1e3f62288febe27db73685dacf1a2f (patch)
tree: 12f2a6101026c5520dce62cbe4242a0ae7f3cb04 /kafka/grobid_kafka_notes.txt
parent: 4cb7c1bdc6710a11c869f3d398ed39762644395c (diff)
download: sandcrawler-7186eb098b1e3f62288febe27db73685dacf1a2f.tar.gz
sandcrawler-7186eb098b1e3f62288febe27db73685dacf1a2f.zip
1 files changed, 24 insertions, 0 deletions
diff --git a/kafka/grobid_kafka_notes.txt b/kafka/grobid_kafka_notes.txt
new file mode 100644
index 0000000..f774291
--- /dev/null
+++ b/kafka/grobid_kafka_notes.txt
@@ -0,0 +1,24 @@
+
+Will want to be able to scale to 100-200+ fully-utilized cores running GROBID;
+how best to achieve this? will need *many* workers going concurrent HTTP GETs,
+POSTs, and Kafka publishes.
+
+I'm pretty confident we can relax "at least once"/"at most once" constraints in
+this case: infrequent re-processing and missing a tiny fraction of processed
+works should be acceptable, because we will have higher-level checks (eg, the
+'ungrobided' HBase filter/dump).
+
+For the 'ungrobided' topic, use a reasonably large number of partitions, say
+50. This sets max number of worker *processes*, and may be enough for initial
+single-host worker. We can have a python wrapper spawn many child processes
+using multiprocessing library, with completely independent kafka client
+connections in each.
+
+To get more concurrency, each consumer *process* creates a thread pool (or
+process pool?), and a Queue with fixed size. Consumes messages, pushes to
+Queue, workers threads pull and do the rest. golang sure would be nice for
+this...
+
+Need to ensure we have compression enabled, for the GROBID output in
+particular! Probably worth using "expensive" GZIP compression to get extra disk
+savings; latency shouldn't be a big deal here.
author	Bryan Newbold <bnewbold@archive.org>	2018-11-20 14:19:40 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2018-11-20 14:19:40 -0800
commit	7186eb098b1e3f62288febe27db73685dacf1a2f (patch)
tree	12f2a6101026c5520dce62cbe4242a0ae7f3cb04 /kafka/grobid_kafka_notes.txt
parent	4cb7c1bdc6710a11c869f3d398ed39762644395c (diff)
download	sandcrawler-7186eb098b1e3f62288febe27db73685dacf1a2f.tar.gz sandcrawler-7186eb098b1e3f62288febe27db73685dacf1a2f.zip