aboutsummaryrefslogtreecommitdiffstats
path: root/kafka/grobid_kafka_notes.txt
blob: a1f73806c065082b769dcd1c5bdd6387ff5363c1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

Will want to be able to scale to 100-200+ fully-utilized cores running GROBID;
how best to achieve this? will need *many* workers going concurrent HTTP GETs,
POSTs, and Kafka publishes.

I'm pretty confident we can relax "at least once"/"at most once" constraints in
this case: infrequent re-processing and missing a tiny fraction of processed
works should be acceptable, because we will have higher-level checks (eg, the
'ungrobided' HBase filter/dump).

For the 'ungrobided' topic, use a reasonably large number of partitions, say
50. This sets max number of worker *processes*, and may be enough for initial
single-host worker. We can have a python wrapper spawn many child processes
using multiprocessing library, with completely independent kafka client
connections in each.

To get more concurrency, each consumer *process* creates a thread pool (or
process pool?), and a Queue with fixed size. Consumes messages, pushes to
Queue, workers threads pull and do the rest. golang sure would be nice for
this...

Need to ensure we have compression enabled, for the GROBID output in
particular! Probably worth using "expensive" GZIP compression to get extra disk
savings; latency shouldn't be a big deal here.

## Commands

Load up some example lines, without partition key:

    head -n10 python/tests/files/example_ungrobided.tsv | kafkacat -P -b localhost:9092 -t sandcrawler-qa.ungrobided

Load up some example lines, with partition key:

    head -n10 python/tests/files/example_ungrobided.tsv | awk -F'\t' '{print $1 "\t" $0}' | kafkacat -K$'\t' -P -b localhost:9092 -t sandcrawler-qa.ungrobided

Check ungrobided topic:

    kafkacat -C -b localhost:9092 -t sandcrawler-qa.ungrobided

Check grobid output:

    kafkacat -C -b localhost:9092 -t sandcrawler-qa.grobid-output

## Actual Production Commands

    gohdfs get sandcrawler/output-prod/2018-11-30-2125.55-dumpungrobided/part-00000
    mv part-00000 2018-11-30-2125.55-dumpungrobided.tsv
    cat 2018-11-30-2125.55-dumpungrobided.tsv | kafkacat -P -b 127.0.0.1:9092 -t sandcrawler-prod.ungrobided

## Performance

On 2018-11-21, using grobid-vm (svc096) with 30 cores, and running with 50x
kafka-grobid-worker processes (using systemd parallelization), achieved:

- 2044 PDFs extracted in 197 seconds, or about 10/second
- that's about 28 hours to process 1 million PDFs

I think this is about all the single machine can handle. To get more throughput
with multiple machines, might need to tweak worker to use a worker thread-pool
or some other concurrent pattern (async?).