Will want to be able to scale to 100-200+ fully-utilized cores running GROBID; how best to achieve this? will need *many* workers going concurrent HTTP GETs, POSTs, and Kafka publishes. I'm pretty confident we can relax "at least once"/"at most once" constraints in this case: infrequent re-processing and missing a tiny fraction of processed works should be acceptable, because we will have higher-level checks (eg, the 'ungrobided' HBase filter/dump). For the 'ungrobided' topic, use a reasonably large number of partitions, say 50. This sets max number of worker *processes*, and may be enough for initial single-host worker. We can have a python wrapper spawn many child processes using multiprocessing library, with completely independent kafka client connections in each. To get more concurrency, each consumer *process* creates a thread pool (or process pool?), and a Queue with fixed size. Consumes messages, pushes to Queue, workers threads pull and do the rest. golang sure would be nice for this... Need to ensure we have compression enabled, for the GROBID output in particular! Probably worth using "expensive" GZIP compression to get extra disk savings; latency shouldn't be a big deal here. ## Commands Load up some example lines, without partition key: head -n10 python/tests/files/example_ungrobided.tsv | kafkacat -P -b localhost:9092 -t sandcrawler-qa.ungrobided Load up some example lines, with partition key: head -n10 python/tests/files/example_ungrobided.tsv | awk -F'\t' '{print $1 "\t" $0}' | kafkacat -K$'\t' -P -b localhost:9092 -t sandcrawler-qa.ungrobided Check ungrobided topic: kafkacat -C -b localhost:9092 -t sandcrawler-qa.ungrobided Check grobid output: kafkacat -C -b localhost:9092 -t sandcrawler-qa.grobid-output ## Actual Production Commands gohdfs get sandcrawler/output-prod/2018-11-30-2125.55-dumpungrobided/part-00000 mv part-00000 2018-11-30-2125.55-dumpungrobided.tsv cat 2018-11-30-2125.55-dumpungrobided.tsv | kafkacat -P -b 127.0.0.1:9092 -t sandcrawler-prod.ungrobided ## Performance On 2018-11-21, using grobid-vm (svc096) with 30 cores, and running with 50x kafka-grobid-worker processes (using systemd parallelization), achieved: - 2044 PDFs extracted in 197 seconds, or about 10/second - that's about 28 hours to process 1 million PDFs I think this is about all the single machine can handle. To get more throughput with multiple machines, might need to tweak worker to use a worker thread-pool or some other concurrent pattern (async?).