aboutsummaryrefslogtreecommitdiffstats
path: root/kafka
diff options
context:
space:
mode:
authorBryan Newbold <bnewbold@archive.org>2018-11-20 15:09:43 -0800
committerBryan Newbold <bnewbold@archive.org>2018-11-20 15:09:43 -0800
commitc12148851e26c14b38ec6cadbe2322829fde23e6 (patch)
tree3918a83320a8f8d26b4ad4b6701391cd2b58035c /kafka
parent7186eb098b1e3f62288febe27db73685dacf1a2f (diff)
downloadsandcrawler-c12148851e26c14b38ec6cadbe2322829fde23e6.tar.gz
sandcrawler-c12148851e26c14b38ec6cadbe2322829fde23e6.zip
initial work on kafka_grobid worker
Diffstat (limited to 'kafka')
-rw-r--r--kafka/grobid_kafka_notes.txt18
1 files changed, 18 insertions, 0 deletions
diff --git a/kafka/grobid_kafka_notes.txt b/kafka/grobid_kafka_notes.txt
index f774291..26c450f 100644
--- a/kafka/grobid_kafka_notes.txt
+++ b/kafka/grobid_kafka_notes.txt
@@ -22,3 +22,21 @@ this...
Need to ensure we have compression enabled, for the GROBID output in
particular! Probably worth using "expensive" GZIP compression to get extra disk
savings; latency shouldn't be a big deal here.
+
+## Commands
+
+Load up some example lines, without partition key:
+
+ head -n10 python/tests/files/example_ungrobided.tsv | kafkacat -P -b localhost:9092 -t sandcrawler-qa.ungrobided
+
+Load up some example lines, with partition key:
+
+ head -n10 python/tests/files/example_ungrobided.tsv | awk -F'\t' '{print $1 "\t" $0}' | kafkacat -K$'\t' -P -b localhost:9092 -t sandcrawler-qa.ungrobided
+
+Check ungrobided topic:
+
+ kafkacat -C -b localhost:9092 -t sandcrawler-qa.ungrobided
+
+Check grobid output:
+
+ kafkacat -C -b localhost:9092 -t sandcrawler-qa.grobided