-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issue: CPU usage of the leader node increased by 20% #18069
Comments
Hi @qixiaoyang0 - Thanks for highlighting this. Adding some context, this performance impact was mentioned as part of the pr that introduced the line you mention. Refer #16822 (comment). The pr was required to fix #15247. I am not sure if there will be much we can do to alleviate this currently, defer to @ahrtr and @serathius on any ideas for an alternative approach (if any). |
|
@ahrtr Thanks for the reply. I'm very sorry that my expression was unclear. "lease update request" means lease renew. |
What was your method of deduction? Did you do any profiling? |
Thanks for the response, but I do not quite understand this. Please feel free to ping me on slack (K8s workspace). |
My test method is:
|
makes me wonder why it increases the CPU usage so much, it merely waits on read state notification. |
In a cluster where the lease refresh request rate is stable, I first collect metrics once and record the values of |
Usually users don't need to renew a lease too frequently. The KeepAlive sends the In order to avoid long back-and-forth communication, please feel free to ping me this week. |
Yes, the client does not need to send renew requests to the server too frequently, but there are more than 1,000 services in our system, and most of the lease TTL are set to 3 or 4, so there will be 1,500 renew requests per second on the server. Thank you for your attention, but I can't find your email address or other message methods. How can I contact you? |
@ahrtr I'd like to help in figuring out this issue, let me know if and how I can help. Thanks. CC: @qixiaoyang0 |
Thanks. @vivekpatani. It'd better to get the following clarified before we close the ticket.
|
ack, will post results in st. @ahrtr |
Hey! We are running into a similar issue although it's manifesting in a slightly different way with step function increases in peer network traffic over time (see image) This continuous increase eventually causes the network buffer to saturate and an etcd node to crash. We have a similarly high rate of lease renewals at 1.7K renewals / second in our production environment. We're just using the We also saw similar behavior at a much smaller scale in our staging environment and discovered that the issue started only from 3.5.13, which lead us to believe that it's the same issue as this. As this issue reports we also see an increase in CPU usage, but our cluster had a baseline of 2% CPU usage prior to the upgrade so the increased CPU isn't a problem for us. Is this sort of step function increase in peer network traffic expected from this change? We attempted to reproduce this locally but have so far been unable to get a repro. |
crash due to what? OOM?
Your leases TTL is 10s, so
I am not 100% sure about this. But we are aware of that the change #17425 indeed causes some performance reduce (e.g. around 20% reduce of QPS of lease renewal request) as mentioned in #16822 (comment). It's a side-effect to resolve the issue #15247 (comment) Please anyone feel free to let me know if you have a better solution to resolve the issue without the performance penalty. |
Hi, @ahrtr, thanks for taking a look! I'm @ranandfigma's coworker. We figured out the cause of the step function increase of the peer network traffic. It’s because our 5-instance etcd cluster was behind an AWS NLB. After we upgraded the etcd cluster in a rolling fashion, the lease renewal traffic was not distributed evenly, with the leader etcd only taking second least traffic. Note that the The actual damage occurred after we accidentally lost one follower (I haven’t figured out the cause yet), the leader started to take more renewal request, and the peer traffic sent/recv’ed by the leader increased to around 100KB/s, and that crossed some network threshold. We saw “dropped internal Raft message since sending buffer is full (overloaded network)” with "message-type=MsgHeartbeat" from the leader and the lease renewals started to fail. The expired leases caused a major churn in our service. For the record, our system had 5700 leases, with TTL=10 s, and the renewal sent every 1/3*TTL, so the total renew requests QPS was 1710, shared equally by the 4 remaining etcd instances, so the leader was handing 427.5 QPS. After we downgraded the etcd cluster to 3.5.12, which didn't have this Here’s my question. In #16822 (comment), you mentioned that according to the benchmark, etcd can still process “50344 P.S., just out of curiosity, shall the http lease renew handler also call ensureLeadership() function? After reading #15247, I think it’s not necessary. Because in case the old leader was stuck, the followers will elect a new leader so they won’t mistakenly forward the renew requests to the old leader. |
do you mind to share a bit more about your setup? if you're running on AWS, on what instance type? you're also running five node etcd? |
Sorry, I didn't realize that #16822 has such a big impact on performance, especially in a large scale. #18428 should be able resolve this issue.
It doesn't make sense to compare the performance, because we get the data in difference environment. I tested in my local environment.
No need. It's (very low) possible that a lease renew may send to a wrong leader, but missing one renew request isn't a big problem, because every lease renew request gets to |
We run the 5-etcd cluster on five m5d.4xlarge. |
Re-open for backports and CHANGELOG |
Thanks @jmhbnz All done, closing... |
The fix will be included in 3.5.16 and 3.4.34. Refer to #18441 |
Pls read my comment
Let me know if you still have any comment before I close this ticket. |
With #18450 I think we can close this. Still I would prefer to use different signal to detect member being stuck. My main concern is with creating a new signal (last probe < 3 ticks) that is not well tested nor exposed in any other way. My suggestion would be to use similar approach for detecting member being stuck that was discussed for liveness probes. It could be then exposed as a metric and integrated into probe making it more transparent. |
Bug report criteria
What happened?
In the ETCD cluster I tested, the lease renew rate is 1500/s, arm64 CPU, and the CPU usage of the leader node increased by 20%.
This problem is because of this modification
etcd/server/etcdserver/v3_server.go
Line 288 in bb701b9
We conducted rigorous tests, including checking pprof, collecting metrics, deleting that line of code and recompiling etcd, and confirmed that this was the reason for the CPU increase.
What did you expect to happen?
It may not be necessary to add leader confirmation in lease renew.
Or using a more performant method.
How can we reproduce it (as minimally and precisely as possible)?
Sending many lease update requests to the ETCD cluster, v3.5.11 version CPU will increase a lot.
Anything else we need to know?
No response
Etcd version (please run commands below)
v3.5.11
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: