notes/performance/kafka_pipeline.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47


## Early Notes (2018-11-13)

Ran through about 100k crossref objects, resulting in about 77k messages (in
about 4k editgroups/changelogs).

Have seen tens of messages per second go through trivially.

The elastic-release worker is the current bottleneck, only some 4.3
messages/second. Because this worker consumes from 8x partitions, I have a
feeling it might be consumer group related. kafka-manager shows "0% coverage"
for this topic. Note that this is a single worker process.

`_consumer_offsets` is seeing about 36 messages/sec.

Oh, looks like I just needed to enable auto_commit and tune parameters in
pykafka!

That helped reduce `_consumer_offsets` churn, significantly, but didn't
increase throughput (or not much). Might want to switch to kafka connect
(presuming it somehow does faster/bulk inserts/indexing), with a simple worker
doing the transforms. Probably worth doing a `> /dev/null` version of the
worker first (with a different consumer group) to make sure the bottlneck isn't
somewhere else.

Another thing to try is more kafka fetch threads.

elastic-release python processing is at 66% (of one core) CPU! and elastic at
~30%. Huh.

But, in general, "seems to be working".

## End-To-End

release-updates: 40/sec
api-crossref: 40/sec
api-datacite: 15/sec
changelog: 11/sec
consumer_offsets: 0.5/sec

elastic indexing looks like only 8/sec or so. Probably need to batch.

Tried running additional fatcat-elasticsearch-release-worker processes, and
throughput goes linearly.

Are consumer group names not actually topic-dependent? Hrm, might need to
rename them all for prod/qa split.