Latency investigation notes #1494

dotnwat · 2021-05-28T02:50:00Z

dotnwat
May 28, 2021
Maintainer

kafka queue depth 1
i3.2xlarge
single node
rpk topic create xxx0 -p1 -r1
chunk size 128k
acks=all

We'll be moving to queue depth 1 at the kafka layer so that we can properly enforce ordering constraints for kafka clients. Scale throughput with more partitions. We have a batching solution for small batches that should still provide good throughput with queue depth 1 by splitting raft replication into a synchronous component (ordering) and an asynchronous component (waiting on flushes) where only the first ordering phase is subject to the queue depth 1 constraint.

ubuntu@ip-172-31-63-63:~/librdkafka/examples$ ./rdkafka_performance -P -b 172.31.63.63:9092 -t xxx0 -s 1024 -a -1 -u                                                                                                                                                                                                              
|    elapsed |       msgs |      bytes |        rtt |         dr |     dr_m/s |    dr_MB/s |     dr_err |     tx_err |       outq |     offset                                                                                                               
|       1000 |     403638 |  413325312 |         31 |     303638 |     303525 |     310.81 |          0 |        296 |     100000 |     303637                                                                                                               
|       2000 |     708203 |  725199872 |         31 |     608243 |     304065 |     311.36 |          0 |        610 |      99960 |     608242                                                                                                               
|       3000 |    1012848 | 1037156352 |         32 |     913815 |     304517 |     311.83 |          0 |        926 |      99033 |     913814                                                                                                               
|       4002 |    1319387 | 1351052288 |         32 |    1219387 |     304675 |     311.99 |          0 |       1244 |     100000 |    1219386                                                                                                               
|       5002 |    1633662 | 1672869888 |         32 |    1534629 |     306784 |     314.15 |          0 |       1567 |      99033 |    1534628                                                                                                               
|       6002 |    1947447 | 1994185728 |         31 |    1847937 |     307871 |     315.26 |          0 |       1892 |      99510 |    1847936                                                                                                               
|       7003 |    2261245 | 2315514880 |         31 |    2162212 |     308733 |     316.14 |          0 |       2217 |      99033 |    2162211                                                                                                               
|       8003 |    2576359 | 2638191616 |         31 |    2476487 |     309426 |     316.85 |          0 |       2541 |      99872 |    2476486                                                                                                               
|       9004 |    2889795 | 2959150080 |         31 |    2790762 |     309938 |     317.38 |          0 |       2865 |      99033 |    2790761                                                                                                               
|      10005 |    3199235 | 3276016640 |         31 |    3100202 |     309849 |     317.29 |          0 |       3185 |      99033 |    3100201                                                                                                               
|      11006 |    3509642 | 3593873408 |         31 |    3409642 |     309777 |     317.21 |          0 |       3506 |     100000 |    3409641                                                                                                               
|      12007 |    3821016 | 3912720384 |         31 |    3721016 |     309886 |     317.32 |          0 |       3828 |     100000 |    3721015                                                                                                               
|      13008 |    4130456 | 4229586944 |         31 |    4030456 |     309836 |     317.27 |          0 |       4149 |     100001 |    4030455                                                                                                               
|      14008 |    4441830 | 4548433920 |         31 |    4341830 |     309932 |     317.37 |          0 |       4472 |     100000 |    4341829                                                                                                               
|      15009 |    4757072 | 4871241728 |         31 |    4657072 |     310273 |     317.72 |          0 |       4799 |     100001 |    4657071                                                                                                               
|      16009 |    5071321 | 5193032704 |         31 |    4971347 |     310523 |     317.98 |          0 |       5122 |      99974 |    4971346                                                                                                               
|      17009 |    5386615 | 5515893760 |         31 |    5287556 |     310857 |     318.32 |          0 |       5449 |      99059 |    5287555                                                                                                               
|      18011 |    5701831 | 5838674944 |         31 |    5601831 |     311015 |     318.48 |          0 |       5775 |     100000 |    5601830                                                                                                               
|      19012 |    6017073 | 6161482752 |         31 |    5918040 |     311278 |     318.75 |          0 |       6102 |      99033 |    5918039                                                                                                               
|      20012 |    6332315 | 6484290560 |         31 |    6233282 |     311464 |     318.94 |          0 |       6427 |      99033 |    6233281                                                                                                               
...

rpk topic create xxx1 -p4 -r1

effectively the same behavior and performance. which is not surprising with the current queue depth 1 implementation: @mmaslankaprv I think I heard you mention that librdkafka only creates a single connection? we should remember this when implementing one-at-a-time processing so that we enforce at a finer granularity (e.g. per partition to deal with this sort of single connection client).

notice above that the rtt is about 30 milliseconds. where does that time go? we can start looking around. in this particular example with acks=all there is effectively 1 mb fsync'd writes split into 4 256kb writes that run in parallel. this is the first part of the trace below

segment_appender::flush sem_wait 1011 flush_wait 100 (total 1112) this line says that we waited 1ms for the 4 writes to complete and then another 100us for the flush.

finally, commit index update wait time 1125 is in the raft layer waiting on the flush to finish before acking back to the higher levels that issued the replicate.

TRACE 2021-05-27 23:58:06,133 [shard 1] segapp - segment_appender.cc:85 - segment_appender::append 999814
TRACE 2021-05-27 23:58:06,134 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 53248@977874944 time 739
TRACE 2021-05-27 23:58:06,134 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 131072@977928192 time 966
TRACE 2021-05-27 23:58:06,134 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 131072@978059264 time 1017
TRACE 2021-05-27 23:58:06,135 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 131072@978190336 time 1070
TRACE 2021-05-27 23:58:06,135 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 131072@978321408 time 1131
TRACE 2021-05-27 23:58:06,135 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 131072@978452480 time 1164
TRACE 2021-05-27 23:58:06,135 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 131072@978583552 time 1209
TRACE 2021-05-27 23:58:06,135 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 131072@978714624 time 1212
TRACE 2021-05-27 23:58:06,135 [shard 1] segapp - segment_appender.cc:413 - segment_appender dma_write 32768@978845696 time 1005
TRACE 2021-05-27 23:58:06,135 [shard 1] segapp - segment_appender.cc:463 - segment_appender::flush sem_wait 1011 flush_wait 100 (total 1112)
INFO  2021-05-27 23:58:06,135 [shard 1] raft - [group_id:1, {kafka/xxx0/0}] replicate_entries_stm.cc:257 - commit index update wait time 1125
TRACE 2021-05-27 23:58:06,136 [shard 1] segapp - segment_appender.cc:85 - segment_appender::append 61

if we look at a summary of a 2 minute run we see

	sem_wait	flush_wait
count	41703.000000	41703.000000
mean	971.541040	110.308443
std	185.595996	33.830967
min	0.000000	9.000000
25%	928.000000	104.000000
50%	991.000000	108.000000
75%	1014.000000	115.000000
90%	1031.000000	124.000000
99%	1086.000000	151.000000
99.9%	2997.298000	185.298000
max	5114.000000	2993.000000

so at the 99.9th percentile we aren't even close to accounting for the roughly 30ms rtt reported by the performance tool.

here is the e2e time for processing a request. this measure the start of reading the header off the input stream, to the time that writing the response to the output stream completes

	sem_wait	flush_wait	e2e_time
count	32814.000000	32814.000000	16239.000000
mean	966.464162	106.032639	2960.008067
std	151.328668	28.869360	310.756391
min	0.000000	9.000000	65.000000
25%	924.000000	101.000000	2782.000000
50%	982.000000	105.000000	2871.000000
75%	1007.000000	109.000000	2983.000000
90%	1022.000000	118.000000	3416.000000
99%	1053.000000	131.000000	3632.000000
99.9%	2849.000000	165.000000	5476.722000
max	5017.000000	2189.000000	10153.000000

so at the 99.9th percentile of 5.4 ms we still aren't accounting for the reported rtt of 30ms.

here is a franz-go client courtesy of @twmb that shows e2e latency of around 2ms! what we expect

package main                                                                                                                                                                                                                                   [313/4700]
import (                                        
        "context"                                             
        "flag"                                                                                                              
        "fmt"                                                 
        "strconv"                                                                                                           
        "sync"                           
        "github.com/twmb/franz-go/pkg/kgo"
)                                                                                                                           
var (                                    
        topic         = flag.String("topic", "", "topic to produce to")                                                                                                                                                                                  
        recordBytes   = flag.Int("record-bytes", 100, "bytes per record")                                                   
        noCompression = flag.Bool("no-compression", true, "set to disable snappy compression")                              
)                                                                                                                           
func chk(err error) {       
        if err != nil {                           
                panic(err)                                                                                                  
        }                                                    
}                                                                                                                           
type hook struct{}                                            
var _ kgo.HookBrokerE2E = new(hook)                          
func (*hook) OnE2E(meta kgo.BrokerMetadata, key int16, e2e kgo.E2EInfo) {                                                   
        if key != 0 {           
                return                              
        }                                      
        fmt.Printf("[node %d] e2e %v\n", meta.NodeID, e2e.DurationE2E())                                                    
}                                              
func main() {                               
        flag.Parse()             
        opts := []kgo.Opt{                     
                kgo.WithHooks(new(hook)),                                                                                                                                                                                                                
                kgo.ProduceTopic(*topic),                                                                                   
                kgo.DisableIdempotentWrite(),   
                kgo.BatchMaxBytes(1024*1024),                         
        }
        if *noCompression {
                opts = append(opts, kgo.BatchCompression(kgo.NoCompression()))
        }
        cl, err := kgo.NewClient(opts...)
        chk(err)
        var num int64
        //var rate int64 // uncomment (and import time & atomic) to print produce rates
        //go func() {
        //      for range time.Tick(time.Second) {
        //              fmt.Printf("%f MiB / s\n", float64(atomic.SwapInt64(&rate, 0))/(1024*1024))
        //      }
        //}()
        for {
                //atomic.AddInt64(&rate, int64(*recordBytes))
                cl.Produce(context.Background(), newRecord(num), func(r *kgo.Record, err error) {
                        p.Put(r)
                        chk(err)
                })
                num++
        }
}
var p = sync.Pool{
        New: func() interface{} {
                s := make([]byte, *recordBytes)
                return &kgo.Record{Value: s}
        },
}
func newRecord(num int64) *kgo.Record {
        var buf [20]byte // max int64 takes 19 bytes, then we add a space
        b := strconv.AppendInt(buf[:0], num, 10)
        b = append(b, ' ')
        r := p.Get().(*kgo.Record)
        var n int
        for n != len(r.Value) {
                n += copy(r.Value[n:], b)
        }
        return r
}

so the next step for single node latency is go figure out what's going on in librdkafka that's leading to the higher reported rtt and if that's real or some artifact of the way the timestamps are taken.

emaxerrno · 2021-05-28T03:05:50Z

emaxerrno
May 28, 2021
Maintainer

@dotnwat can this be some threading issue inside librdkafka?

I wonder if we write our own benchmarking tool now that we have a client. I wrote one that takes into account coordinated omission (kinda jenky, but worked https://github.com/smfrpc/smf/blob/master/src/include/smf/load_channel.h)

so we can also ship it with rpk loadbench ....

28ms on some client side stuff seems insane, so the theory is that maybe could be the clock source in librdkafka if it has some low-res clock?

just some ideas.

3 replies

dotnwat May 28, 2021
Maintainer Author

rtt are reported as lower with kafka, so... there is something else going on

emaxerrno May 28, 2021
Maintainer

ah great insight.

emaxerrno May 28, 2021
Maintainer

wellllll very excited!!!!! to find out

dotnwat · 2021-05-28T04:34:15Z

dotnwat
May 28, 2021
Maintainer Author

A separate tool, the kafka produce performance test, reports a metric for a similar workload

producer-metrics:request-latency-avg:{client-id=producer-1} : 6.096 so 6ms is not unreasonable. The same tool's output also reports 1328035 records sent, 265553.9 records/sec (259.33 MB/sec), 100.5 ms avg latency, 303.0 ms max latency. so same tool one metric that seems to match the expectation for Redpanda-side processing time, and another latency metric that is... I dunno what 100ms is all about. There is no reason that librdkafka is also reporting some latency metric that isn't quite what we expect.

0 replies

dotnwat · 2021-05-28T21:10:05Z

dotnwat
May 28, 2021
Maintainer Author

We finally tracked down the source of the difference between rtt latency reported by librdkafka and the processing time for a request in redpanda. Effectively we were seeing queueing within the network stack: librdkafka will dispatch the requests onto the network and the rtt includes the time spend sitting in networking buffers...

this can be controlled with queue.buffering.max.messages=queue-depth.

Here is the rtt in milliseconds for a couple queue depth settings. 1024 is roughly one 1mb batch outstanding and 8192 would be roughly a queue depth of 8 batches outstanding.

So for example at qd=2 we get 3ms rtt.

	count	mean	std	min	25%	50%	75%	max
qd								
1024	125.0	1.984000	0.178885	0.0	2.0	2.0	2.0	2.0
2048	86.0	3.046512	0.211825	3.0	3.0	3.0	3.0	4.0
4096	87.0	8.896552	0.964901	0.0	9.0	9.0	9.0	9.0
8192	90.0	20.944444	0.230345	20.0	21.0	21.0	21.0	21.0

Here is the corresponding throughput numbers. For instance at qd=2 we can push 365 mb/s.

	count	mean	std	min	25%	50%	75%	max
qd								
1024	125.0	250.195920	0.717756	248.02	249.7300	249.97	250.5400	255.40
2048	86.0	365.009535	0.729951	361.23	364.5550	365.31	365.5075	366.32
4096	87.0	362.443563	0.751207	359.07	362.1900	362.66	362.9400	363.29
8192	90.0	349.049000	0.572091	345.38	349.0425	349.14	349.2975	349.77

Keep in mind that this is single node, single partition, and queue depth = 1 inside the kafka layer. This means that throughput is very sensitive to latency. This serialization in the kafka layer is also why our throughput maxes out fairly quickly.

What we'll be looking at next is some optimizations that allow us to keep mostly qd=1 on the kafka side, while improving throughput for produce handling. This should generally give us better latencies at deeper queue depths.

2 replies

emaxerrno May 28, 2021
Maintainer

we should do a talk on this!

twmb May 30, 2021
Maintainer

Was queue.buffering.max.ms also set to 0, to disable lingering? If I'm reading queue.buffering.max.messages docs correctly, this is just the number of messages that are allowed to be buffered within librdkafka before it begins returning RD_KAFKA_RESP_ERR__QUEUE_FULL immediately on produce (ctrl+f queue.buffering.max.messages in the introduction). AFAICT, this is not related to network buffers? I think maybe by lowering queue.buffering.max.messages, you are forcing the client to flush more quickly, but I think the same behavior can be forced by dropping the queue.buffering.max.ms to 0.

I dug into the librdkafka code a little bit, it appears that rtt start is taken immediately after a write of the request happens on the wire, and rtt end is taken once a response is fully read. So even if dropping the linger down to 0, I'm not sure how this is related to queue.buffer.max.* settings.

Start: https://github.com/edenhill/librdkafka/blob/c28fbbb45dd14be57e577c05515d1a4af8ac70ff/src/rdkafka_broker.c#L2634-L2676 (see down to where rkbuf_ts_sent is set)
End: https://github.com/edenhill/librdkafka/blob/c28fbbb45dd14be57e577c05515d1a4af8ac70ff/src/rdkafka_broker.c#L1710-L1718

I'm not sure why rtt would begin after the write. We don't need to worry about partial writes, because a partial write will just cause a future continuation of the write, and rkbuf_ts_sent is after every write.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency investigation notes #1494

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Latency investigation notes #1494

dotnwat May 28, 2021 Maintainer

Replies: 3 comments · 5 replies

emaxerrno May 28, 2021 Maintainer

dotnwat May 28, 2021 Maintainer Author

emaxerrno May 28, 2021 Maintainer

emaxerrno May 28, 2021 Maintainer

dotnwat May 28, 2021 Maintainer Author

dotnwat May 28, 2021 Maintainer Author

emaxerrno May 28, 2021 Maintainer

twmb May 30, 2021 Maintainer

dotnwat
May 28, 2021
Maintainer

Replies: 3 comments 5 replies

emaxerrno
May 28, 2021
Maintainer

dotnwat May 28, 2021
Maintainer Author

emaxerrno May 28, 2021
Maintainer

emaxerrno May 28, 2021
Maintainer

dotnwat
May 28, 2021
Maintainer Author

dotnwat
May 28, 2021
Maintainer Author

emaxerrno May 28, 2021
Maintainer

twmb May 30, 2021
Maintainer