Performance Issue: CPU usage of the leader node increased by 20% #18069

qixiaoyang0 · 2024-05-27T03:01:02Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

In the ETCD cluster I tested, the lease renew rate is 1500/s, arm64 CPU, and the CPU usage of the leader node increased by 20%.

This problem is because of this modification

etcd/server/etcdserver/v3_server.go

Line 288 in bb701b9

if !s.ensureLeadership() {

We conducted rigorous tests, including checking pprof, collecting metrics, deleting that line of code and recompiling etcd, and confirmed that this was the reason for the CPU increase.

What did you expect to happen?

It may not be necessary to add leader confirmation in lease renew.
Or using a more performant method.

How can we reproduce it (as minimally and precisely as possible)?

Sending many lease update requests to the ETCD cluster, v3.5.11 version CPU will increase a lot.

Anything else we need to know?

No response

Etcd version (please run commands below)

v3.5.11

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

jmhbnz · 2024-05-27T03:31:17Z

Hi @qixiaoyang0 - Thanks for highlighting this. Adding some context, this performance impact was mentioned as part of the pr that introduced the line you mention. Refer #16822 (comment).

The pr was required to fix #15247.

I am not sure if there will be much we can do to alleviate this currently, defer to @ahrtr and @serathius on any ideas for an alternative approach (if any).

ahrtr · 2024-05-27T06:19:48Z

Sending many lease update requests to the ETCD cluster, v3.5.11 version CPU will increase a lot.

How did you send "lease update request"? Kept sending KeepAliveOnce?
If yes (to above question), do you have a real use case to keep sending KeepAliveOnce? Have you considered call KeepAlive?

qixiaoyang0 · 2024-05-27T07:32:29Z

Sending many lease update requests to the ETCD cluster, v3.5.11 version CPU will increase a lot.

How did you send "lease update request"? Kept sending KeepAliveOnce?

If yes (to above question), do you have a real use case to keep sending KeepAliveOnce? Have you considered call KeepAlive?

@ahrtr Thanks for the reply. I'm very sorry that my expression was unclear. "lease update request" means lease renew.
Our method of renewing lease is neither KeepAliveOnce nor KeepAlive. Instead, we use keepalive developed based on grpc stream. For etcd server, this is equivalent to KeepAlive.

serathius · 2024-05-27T07:41:52Z

This problem is because of this modification

What was your method of deduction? Did you do any profiling?

ahrtr · 2024-05-27T07:46:49Z

Instead, we use keepalive developed based on grpc stream.

Thanks for the response, but I do not quite understand this. Please feel free to ping me on slack (K8s workspace).

qixiaoyang0 · 2024-05-27T08:15:17Z

This problem is because of this modification

What was your method of deduction? Did you do any profiling?

My test method is:

The CPU growth was not discovered using standard test cases, but was discovered in our production environment. In our cluster, we observed through log statistics that the lease renew rate is 1,500 times per second. . I think it can be reproduced using benchmark tests.
Check the CPU usage through pprof and find that the main growth hot spots are rafthttp.(*streamReader).run and rafthttp.(*streamWriter).run
In metrics, the growth rates of etcd_network_peer_received_bytes_total and etcd_network_peer_sent_bytes_total are an order of magnitude higher than those of versions before V3.5.13.
In etcdserver.(*EtcdServer).LeaseRenew, I deleted ensureLeadership, recompiled the binary file, and replaced it with the cluster. The CPU usage dropped to the value before version V3.5.13.

tjungblu · 2024-05-27T08:25:50Z

In etcdserver.(*EtcdServer).LeaseRenew, I deleted ensureLeadership

makes me wonder why it increases the CPU usage so much, it merely waits on read state notification.
edit: etcd_network_peer_sent_bytes_total are an order of magnitude higher proto parsing? :)

qixiaoyang0 · 2024-05-27T08:56:40Z

In etcdserver.(*EtcdServer).LeaseRenew, I deleted ensureLeadership

makes me wonder why it increases the CPU usage so much, it merely waits on read state notification. edit: etcd_network_peer_sent_bytes_total are an order of magnitude higher proto parsing? :)

In a cluster where the lease refresh request rate is stable, I first collect metrics once and record the values of etcd_network_peer_sent_bytes_total and etcd_network_peer_received_bytes_total. After 10 minutes, I collect metrics and record the values. In this way, I can guide the sending rate and receiving rate.
Comparing the version before v3.5.13 with the v3.5.13 version, the sent rate and received rate change: 9,523/s to 43,034/s

ahrtr · 2024-05-27T13:39:27Z

Usually users don't need to renew a lease too frequently. The KeepAlive sends the LeaseKeepAliveRequest every TTL/3 seconds, can you follow the similar pattern? If not, can you please explain why?

In order to avoid long back-and-forth communication, please feel free to ping me this week.

qixiaoyang0 · 2024-05-28T03:38:41Z

Usually users don't need to renew a lease too frequently. The KeepAlive sends the LeaseKeepAliveRequest every TTL/3 seconds, can you follow the similar pattern? If not, can you please explain why?

In order to avoid long back-and-forth communication, please feel free to ping me this week.

Yes, the client does not need to send renew requests to the server too frequently, but there are more than 1,000 services in our system, and most of the lease TTL are set to 3 or 4, so there will be 1,500 renew requests per second on the server.

Thank you for your attention, but I can't find your email address or other message methods. How can I contact you?

ahrtr · 2024-05-28T11:54:15Z

Thank you for your attention, but I can't find your email address or other message methods. How can I contact you?

slack as mentioned above or email

vivekpatani · 2024-05-31T21:20:57Z

@ahrtr I'd like to help in figuring out this issue, let me know if and how I can help. Thanks.

CC: @qixiaoyang0

ahrtr · 2024-06-01T07:38:52Z

@ahrtr I'd like to help in figuring out this issue, let me know if and how I can help. Thanks.

CC: @qixiaoyang0

Thanks. @vivekpatani. It'd better to get the following clarified before we close the ticket.

The reduce (around 20%) of lease renew QPS is expected as mentioned in comment, but the impact on the CPU usage isn't evaluated in the first place. But it seems a bit counterintuitive, because #16822 just added some network I/O.
We mainly need to understand the use case.
- Usually clients do not need to renew a lease too frequently.
- It isn't recommended to set a too small TLL (e.g. < 5s) for lease. Otherwise it may be affected by either client side or server side delay.
As mentioned in comment, the change in #16822 has no any impact on K8s, and the performance on lease renew isn't that critical, so most likely we won't change anything for this ticket. But still happy to get above items clarified before we close this ticket.

vivekpatani · 2024-06-10T21:45:40Z

ack, will post results in st. @ahrtr

ranandfigma · 2024-08-07T16:35:45Z

Hey! We are running into a similar issue although it's manifesting in a slightly different way with step function increases in peer network traffic over time (see image)

This continuous increase eventually causes the network buffer to saturate and an etcd node to crash. We have a similarly high rate of lease renewals at 1.7K renewals / second in our production environment. We're just using the KeepAlive call to renew leases with a TTL of 10 seconds (although from the nodejs client).

We also saw similar behavior at a much smaller scale in our staging environment and discovered that the issue started only from 3.5.13, which lead us to believe that it's the same issue as this. As this issue reports we also see an increase in CPU usage, but our cluster had a baseline of 2% CPU usage prior to the upgrade so the increased CPU isn't a problem for us.

Is this sort of step function increase in peer network traffic expected from this change? We attempted to reproduce this locally but have so far been unable to get a repro.

ahrtr · 2024-08-08T12:22:17Z

causes the network buffer to saturate and an etcd node to crash

crash due to what? OOM?

lease renewals at 1.7K renewals / second in our production environment

Your leases TTL is 10s, so KeepAlive sends a renewal every TTL/3 = 3.3s, so you have roughly 5.6K leases?

Is this sort of step function increase in peer network traffic expected from this change?

I am not 100% sure about this. But we are aware of that the change #17425 indeed causes some performance reduce (e.g. around 20% reduce of QPS of lease renewal request) as mentioned in #16822 (comment). It's a side-effect to resolve the issue #15247 (comment)

Please anyone feel free to let me know if you have a better solution to resolve the issue without the performance penalty.

caesarxuchao · 2024-08-09T06:18:16Z

Hi, @ahrtr, thanks for taking a look! I'm @ranandfigma's coworker.

We figured out the cause of the step function increase of the peer network traffic. It’s because our 5-instance etcd cluster was behind an AWS NLB. After we upgraded the etcd cluster in a rolling fashion, the lease renewal traffic was not distributed evenly, with the leader etcd only taking second least traffic. Note that the ensureLeadership() was only called if the renewal requests were handled by the leader etcd, thus at first the peer traffic was not increased by a lot. Then every time we redeploy our clients, the lease renewal traffic was rebalanced. With the leader taking a larger share of renewal requests, the peer traffic increased. The peer traffic plateaued after the traffic was distributed evenly among the etcd instances.

The actual damage occurred after we accidentally lost one follower (I haven’t figured out the cause yet), the leader started to take more renewal request, and the peer traffic sent/recv’ed by the leader increased to around 100KB/s, and that crossed some network threshold. We saw “dropped internal Raft message since sending buffer is full (overloaded network)” with "message-type=MsgHeartbeat" from the leader and the lease renewals started to fail. The expired leases caused a major churn in our service. For the record, our system had 5700 leases, with TTL=10 s, and the renewal sent every 1/3*TTL, so the total renew requests QPS was 1710, shared equally by the 4 remaining etcd instances, so the leader was handing 427.5 QPS.

After we downgraded the etcd cluster to 3.5.12, which didn't have this ensureLeadership(), the peer network traffic send/recv'ed by the leader was at 2KB/s.

Here’s my question. In #16822 (comment), you mentioned that according to the benchmark, etcd can still process “50344 LeaseKeepAliveRequests per second”, which was much higher than the 427.5 QPS we saw, where our etcd leader started dropping renewal requests. I’m curious what’s the performance bottleneck in the benchmark setup, was it also limited by the internal Raft heartbeat messages? Also please let us know if there is a way we can work around this lease renewal bottleneck.

P.S., just out of curiosity, shall the http lease renew handler also call ensureLeadership() function? After reading #15247, I think it’s not necessary. Because in case the old leader was stuck, the followers will elect a new leader so they won’t mistakenly forward the renew requests to the old leader.

tjungblu · 2024-08-09T09:18:46Z

do you mind to share a bit more about your setup? if you're running on AWS, on what instance type? you're also running five node etcd?

ahrtr · 2024-08-09T10:53:26Z

peer traffic sent/recv’ed by the leader increased to around 100KB/s

After we downgraded the etcd cluster to 3.5.12, which didn't have this ensureLeadership(), the peer network traffic send/recv'ed by the leader was at 2KB/s.

Sorry, I didn't realize that #16822 has such a big impact on performance, especially in a large scale. #18428 should be able resolve this issue.

etcd can still process “50344 LeaseKeepAliveRequests per second”, which was much higher than the 427.5 QPS we saw

It doesn't make sense to compare the performance, because we get the data in difference environment. I tested in my local environment.

P.S., just out of curiosity, shall the http lease renew handler also call ensureLeadership() function?

No need. It's (very low) possible that a lease renew may send to a wrong leader, but missing one renew request isn't a big problem, because every lease renew request gets to EtcdServer.LeaseRenew firstly.

caesarxuchao · 2024-08-09T19:35:25Z

do you mind to share a bit more about your setup? if you're running on AWS, on what instance type? you're also running five node etcd?

We run the 5-etcd cluster on five m5d.4xlarge.

jmhbnz · 2024-08-14T20:20:45Z

Re-open for backports and CHANGELOG

ahrtr · 2024-08-14T20:34:06Z

Thanks @jmhbnz

All done, closing...

ahrtr · 2024-08-14T20:50:18Z

The fix will be included in 3.5.16 and 3.4.34. Refer to #18441

serathius · 2024-08-15T11:48:32Z

#18428 (comment)

ahrtr · 2024-08-15T16:03:57Z

#18428 (comment)

Pls read my comment

Skip leadership check if the etcd instance is active processing heartbeats #18428 (comment)
Skip leadership check if the etcd instance is active processing heartbeats #18428 (comment)
Skip leadership check if the etcd instance is active processing heartbeats #18428 (comment)

Let me know if you still have any comment before I close this ticket.

serathius · 2024-08-16T15:34:08Z

With #18450 I think we can close this.

Still I would prefer to use different signal to detect member being stuck. My main concern is with creating a new signal (last probe < 3 ticks) that is not well tested nor exposed in any other way. My suggestion would be to use similar approach for detecting member being stuck that was discussed for liveness probes. It could be then exposed as a metric and integrated into probe making it more transparent.

qixiaoyang0 added the type/bug label May 27, 2024

jmhbnz added the area/performance label May 27, 2024

ahrtr mentioned this issue Aug 9, 2024

Skip leadership check if the etcd instance is active processing heartbeats #18428

Merged

ahrtr closed this as completed in #18428 Aug 14, 2024

jmhbnz reopened this Aug 14, 2024

ahrtr closed this as completed Aug 14, 2024

serathius reopened this Aug 15, 2024

serathius closed this as completed Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue: CPU usage of the leader node increased by 20% #18069

Performance Issue: CPU usage of the leader node increased by 20% #18069

qixiaoyang0 commented May 27, 2024 •

edited

Loading

paste your configuration here

jmhbnz commented May 27, 2024

ahrtr commented May 27, 2024

qixiaoyang0 commented May 27, 2024

serathius commented May 27, 2024 •

edited

Loading

ahrtr commented May 27, 2024

qixiaoyang0 commented May 27, 2024

tjungblu commented May 27, 2024 •

edited

Loading

qixiaoyang0 commented May 27, 2024

ahrtr commented May 27, 2024

qixiaoyang0 commented May 28, 2024

ahrtr commented May 28, 2024

vivekpatani commented May 31, 2024

ahrtr commented Jun 1, 2024

vivekpatani commented Jun 10, 2024

ranandfigma commented Aug 7, 2024

ahrtr commented Aug 8, 2024

caesarxuchao commented Aug 9, 2024

tjungblu commented Aug 9, 2024

ahrtr commented Aug 9, 2024

caesarxuchao commented Aug 9, 2024

jmhbnz commented Aug 14, 2024

ahrtr commented Aug 14, 2024

ahrtr commented Aug 14, 2024

serathius commented Aug 15, 2024

ahrtr commented Aug 15, 2024

serathius commented Aug 16, 2024

Performance Issue: CPU usage of the leader node increased by 20% #18069

Performance Issue: CPU usage of the leader node increased by 20% #18069

Comments

qixiaoyang0 commented May 27, 2024 • edited Loading

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

jmhbnz commented May 27, 2024

ahrtr commented May 27, 2024

qixiaoyang0 commented May 27, 2024

serathius commented May 27, 2024 • edited Loading

ahrtr commented May 27, 2024

qixiaoyang0 commented May 27, 2024

tjungblu commented May 27, 2024 • edited Loading

qixiaoyang0 commented May 27, 2024

ahrtr commented May 27, 2024

qixiaoyang0 commented May 28, 2024

ahrtr commented May 28, 2024

vivekpatani commented May 31, 2024

ahrtr commented Jun 1, 2024

vivekpatani commented Jun 10, 2024

ranandfigma commented Aug 7, 2024

ahrtr commented Aug 8, 2024

caesarxuchao commented Aug 9, 2024

tjungblu commented Aug 9, 2024

ahrtr commented Aug 9, 2024

caesarxuchao commented Aug 9, 2024

jmhbnz commented Aug 14, 2024

ahrtr commented Aug 14, 2024

ahrtr commented Aug 14, 2024

serathius commented Aug 15, 2024

ahrtr commented Aug 15, 2024

serathius commented Aug 16, 2024

qixiaoyang0 commented May 27, 2024 •

edited

Loading

serathius commented May 27, 2024 •

edited

Loading

tjungblu commented May 27, 2024 •

edited

Loading