update notes

author: Bryan Newbold <bnewbold@archive.org> 2018-12-10 13:33:41 +0800
committer: Bryan Newbold <bnewbold@archive.org> 2018-12-10 13:33:41 +0800
commit: 6e8305e625f8b033d2697d40ed31ec15368678f9 (patch)
tree: cec31f542750e922786a1e3bf8a6eb60529ab06e /kafka
parent: 4736db1b1caca50a83bf7fb0d45e2e8d48d4e233 (diff)
download: sandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.tar.gz
sandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.zip
2 files changed, 45 insertions, 0 deletions
diff --git a/kafka/debugging_issues.txt b/kafka/debugging_issues.txt
new file mode 100644
index 0000000..1af490e
--- /dev/null
+++ b/kafka/debugging_issues.txt
@@ -0,0 +1,39 @@
+
+## 2018-12-02
+
+Had been having some troubles with consumer group partition assignments with
+the grobid-output group and grobid-hbase-insert consumer group. Tried deleting
+and re-creating, which was probbaly a mistake. Also tried to use kafka-broker
+shell scripts to cleanup/debug and didn't work well.
+
+In the end, after re-building the topic, decided to create a new consumer group
+(grobid-hbase-insert2) to get rid of history/crap. Might need to do this again
+in the future, oh well.
+
+A few things learned:
+
+- whatever pykafka "native python" is producing to consumer group offsets
+  doesn't work great with kafka-manager or the shell scripts: consumer instance
+  names don't show. this is an error in shell scripts, and blank/red in
+  kafka-manager
+- restarting kafka-manager takes a while (for it to refresh data?) and it shows
+  inconsistent stuff during that period, but it does result in cleaned up
+  consumer group cached metadata (aka, old groups are cleared)
+- kafka-manager can't fetch JXM info, either due to lack of config or port
+  blocking. should try to fix this for metrics etc
+- it would be nice to be using recent librdkafka everywhere. pykafka can
+  optionally use this, and many other tools do automatically. however, this is
+  a system package, and xenial doesn't have backports (debian stretch does).
+  the version in bionic looks "good enough", so many should try that?
+- there has been a minor release of kafka (2.1) since I installed (!)
+- the burrow (consumer group monitoring) tool is packaged for some version of
+  ubuntu
+
+In general, not feally great about the current setup. Very frustrating that the
+debug/status tools are broken with pykafka native output. Need to at least
+document things a lot better.
+
+Separately, came up with an idea to do batched processing with GROBID: don't
+auto-commit, instead consume a batch (10? or until block), process those, then
+commit. This being a way to get "the batch size returned".
+
diff --git a/kafka/grobid_kafka_notes.txt b/kafka/grobid_kafka_notes.txt
index d8bb171..b4fa2a8 100644
--- a/kafka/grobid_kafka_notes.txt
+++ b/kafka/grobid_kafka_notes.txt
@@ -41,6 +41,12 @@ Check grobid output:
 
     kafkacat -C -b localhost:9092 -t sandcrawler-qa.grobid-output
 
+## Actual Production Commands
+
+    gohdfs get sandcrawler/output-prod/2018-11-30-2125.55-dumpungrobided/part-00000
+    mv part-00000 2018-11-30-2125.55-dumpungrobided.tsv
+    cat 2018-11-30-2125.55-dumpungrobided.tsv | kafkacat -P -b localhost:9092 -t sandcrawler-prod.ungrobided
+
 ## Performance
 
 On 2018-11-21, using grobid-vm (svc096) with 30 cores, and running with 50x
author	Bryan Newbold <bnewbold@archive.org>	2018-12-10 13:33:41 +0800
committer	Bryan Newbold <bnewbold@archive.org>	2018-12-10 13:33:41 +0800
commit	6e8305e625f8b033d2697d40ed31ec15368678f9 (patch)
tree	cec31f542750e922786a1e3bf8a6eb60529ab06e /kafka
parent	4736db1b1caca50a83bf7fb0d45e2e8d48d4e233 (diff)
download	sandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.tar.gz sandcrawler-6e8305e625f8b033d2697d40ed31ec15368678f9.zip