1 files changed, 60 insertions, 0 deletions
diff --git a/kafka/grobid_kafka_notes.txt b/kafka/grobid_kafka_notes.txt
new file mode 100644
index 0000000..a1f7380
--- /dev/null
+++ b/kafka/grobid_kafka_notes.txt
@@ -0,0 +1,60 @@
+
+Will want to be able to scale to 100-200+ fully-utilized cores running GROBID;
+how best to achieve this? will need *many* workers going concurrent HTTP GETs,
+POSTs, and Kafka publishes.
+
+I'm pretty confident we can relax "at least once"/"at most once" constraints in
+this case: infrequent re-processing and missing a tiny fraction of processed
+works should be acceptable, because we will have higher-level checks (eg, the
+'ungrobided' HBase filter/dump).
+
+For the 'ungrobided' topic, use a reasonably large number of partitions, say
+50. This sets max number of worker *processes*, and may be enough for initial
+single-host worker. We can have a python wrapper spawn many child processes
+using multiprocessing library, with completely independent kafka client
+connections in each.
+
+To get more concurrency, each consumer *process* creates a thread pool (or
+process pool?), and a Queue with fixed size. Consumes messages, pushes to
+Queue, workers threads pull and do the rest. golang sure would be nice for
+this...
+
+Need to ensure we have compression enabled, for the GROBID output in
+particular! Probably worth using "expensive" GZIP compression to get extra disk
+savings; latency shouldn't be a big deal here.
+
+## Commands
+
+Load up some example lines, without partition key:
+
+    head -n10 python/tests/files/example_ungrobided.tsv | kafkacat -P -b localhost:9092 -t sandcrawler-qa.ungrobided
+
+Load up some example lines, with partition key:
+
+    head -n10 python/tests/files/example_ungrobided.tsv | awk -F'\t' '{print $1 "\t" $0}' | kafkacat -K$'\t' -P -b localhost:9092 -t sandcrawler-qa.ungrobided
+
+Check ungrobided topic:
+
+    kafkacat -C -b localhost:9092 -t sandcrawler-qa.ungrobided
+
+Check grobid output:
+
+    kafkacat -C -b localhost:9092 -t sandcrawler-qa.grobid-output
+
+## Actual Production Commands
+
+    gohdfs get sandcrawler/output-prod/2018-11-30-2125.55-dumpungrobided/part-00000
+    mv part-00000 2018-11-30-2125.55-dumpungrobided.tsv
+    cat 2018-11-30-2125.55-dumpungrobided.tsv | kafkacat -P -b 127.0.0.1:9092 -t sandcrawler-prod.ungrobided
+
+## Performance
+
+On 2018-11-21, using grobid-vm (svc096) with 30 cores, and running with 50x
+kafka-grobid-worker processes (using systemd parallelization), achieved:
+
+- 2044 PDFs extracted in 197 seconds, or about 10/second
+- that's about 28 hours to process 1 million PDFs
+
+I think this is about all the single machine can handle. To get more throughput
+with multiple machines, might need to tweak worker to use a worker thread-pool
+or some other concurrent pattern (async?).