diff options
author | Bryan Newbold <bnewbold@archive.org> | 2018-11-20 15:09:43 -0800 |
---|---|---|
committer | Bryan Newbold <bnewbold@archive.org> | 2018-11-20 15:09:43 -0800 |
commit | c12148851e26c14b38ec6cadbe2322829fde23e6 (patch) | |
tree | 3918a83320a8f8d26b4ad4b6701391cd2b58035c /kafka | |
parent | 7186eb098b1e3f62288febe27db73685dacf1a2f (diff) | |
download | sandcrawler-c12148851e26c14b38ec6cadbe2322829fde23e6.tar.gz sandcrawler-c12148851e26c14b38ec6cadbe2322829fde23e6.zip |
initial work on kafka_grobid worker
Diffstat (limited to 'kafka')
-rw-r--r-- | kafka/grobid_kafka_notes.txt | 18 |
1 files changed, 18 insertions, 0 deletions
diff --git a/kafka/grobid_kafka_notes.txt b/kafka/grobid_kafka_notes.txt index f774291..26c450f 100644 --- a/kafka/grobid_kafka_notes.txt +++ b/kafka/grobid_kafka_notes.txt @@ -22,3 +22,21 @@ this... Need to ensure we have compression enabled, for the GROBID output in particular! Probably worth using "expensive" GZIP compression to get extra disk savings; latency shouldn't be a big deal here. + +## Commands + +Load up some example lines, without partition key: + + head -n10 python/tests/files/example_ungrobided.tsv | kafkacat -P -b localhost:9092 -t sandcrawler-qa.ungrobided + +Load up some example lines, with partition key: + + head -n10 python/tests/files/example_ungrobided.tsv | awk -F'\t' '{print $1 "\t" $0}' | kafkacat -K$'\t' -P -b localhost:9092 -t sandcrawler-qa.ungrobided + +Check ungrobided topic: + + kafkacat -C -b localhost:9092 -t sandcrawler-qa.ungrobided + +Check grobid output: + + kafkacat -C -b localhost:9092 -t sandcrawler-qa.grobided |