more kafka/grobid notes

author: Bryan Newbold <bnewbold@archive.org> 2018-11-21 22:07:26 -0800
committer: Bryan Newbold <bnewbold@archive.org> 2018-11-21 22:07:26 -0800
commit: 16f567d88cca7e79c36e4c06205861c7fe70bfa7 (patch)
tree: 97b72f350adc8b7e74071cf7f6c3c902779a1cb6 /kafka
parent: 73a9d1aa622865994d50bc8db097e339cbc29fe9 (diff)
download: sandcrawler-16f567d88cca7e79c36e4c06205861c7fe70bfa7.tar.gz
sandcrawler-16f567d88cca7e79c36e4c06205861c7fe70bfa7.zip
1 files changed, 12 insertions, 0 deletions
diff --git a/kafka/grobid_kafka_notes.txt b/kafka/grobid_kafka_notes.txt
index 0e565aa..d8bb171 100644
--- a/kafka/grobid_kafka_notes.txt
+++ b/kafka/grobid_kafka_notes.txt
@@ -40,3 +40,15 @@ Check ungrobided topic:
 Check grobid output:
 
     kafkacat -C -b localhost:9092 -t sandcrawler-qa.grobid-output
+
+## Performance
+
+On 2018-11-21, using grobid-vm (svc096) with 30 cores, and running with 50x
+kafka-grobid-worker processes (using systemd parallelization), achieved:
+
+- 2044 PDFs extracted in 197 seconds, or about 10/second
+- that's about 28 hours to process 1 million PDFs
+
+I think this is about all the single machine can handle. To get more throughput
+with multiple machines, might need to tweak worker to use a worker thread-pool
+or some other concurrent pattern (async?).
author	Bryan Newbold <bnewbold@archive.org>	2018-11-21 22:07:26 -0800
committer	Bryan Newbold <bnewbold@archive.org>	2018-11-21 22:07:26 -0800
commit	16f567d88cca7e79c36e4c06205861c7fe70bfa7 (patch)
tree	97b72f350adc8b7e74071cf7f6c3c902779a1cb6 /kafka
parent	73a9d1aa622865994d50bc8db097e339cbc29fe9 (diff)
download	sandcrawler-16f567d88cca7e79c36e4c06205861c7fe70bfa7.tar.gz sandcrawler-16f567d88cca7e79c36e4c06205861c7fe70bfa7.zip