After a large number of watch connections are disconnected from a client at the same time, the new watch cannot work properly. #18879

alterge1st · 2024-11-12T02:57:24Z

Bug report criteria

This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
You have read the etcd bug reporting guidelines.
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.

What happened?

We used the in-process etcdserver of v3client. Then we created a client, created a watch connection to the same resource every second, without freeing them, and ran it for more than 1 minute. When the client maintains a large number of watch connections, we kill the client process. After the client process is killed, when other clients attempt to establish watch connections for the same resource, the new watch connections cannot obtain new event changes.

What did you expect to happen?

After the client is killed, the new watch connection for the same resource can properly listen to event changes.
And after analysis, the blocking problem exists. Although it is unreasonable for the client to establish a large number of watch connections with the same resource at the same time, can the etcd server do something to avoid the blocking?

How can we reproduce it (as minimally and precisely as possible)?

We created a large number of Watch connections to the same configmap resource in a loop through a process using code similar to the following:
main.txt
After running this program for 1 minute, kill the program. When you continue to run the kubectl get configmap -A -w command, after the configmap is modified, the configmap change cannot be watched.

Anything else we need to know?

After the client is killed, a large number of watch connections are disconnected. The code analysis shows that the Send() function of WatchCancelRequest in case ws := <-w.closingc of the (w *watchGrpcStream) run() method in etcd/client/v3/watch.go is blocked and unable to continue processing.
It is suspected that a large number of WatchCancelRequests cause the channel in watchGrpcStream to be fully occupied. As a result, new WatchResponse cannot be pushed into sws.ctrlStream. The WatchResponse obtained from ctrlStream and new WatchResponse are blocked in case pbresp := <-w.respc and case ws := <-w.closingc in (w *watchGrpcStream) run().

Etcd version (please run commands below)

$ etcd --version
# 3.5.11
$ etcdctl version
# 3.5.11

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

alterge1st · 2024-11-12T03:04:50Z

When the size of channel ctrlStream(ctrlStreamBufLen) is increased, we need to create more watch connections and disconnect them to reproduce the blocking problem.

ahrtr · 2024-11-12T08:10:20Z

We recently fixed a watch related goroutine leak issue, #18784

The fix will be included in 3.5.17, which is supposed to be released this week. Please try again with the new version once available.

alterge1st · 2024-11-12T08:20:07Z

We recently fixed a watch related goroutine leak issue, #18784

The fix will be included in 3.5.17, which is supposed to be released this week. Please try again with the new version once available.

Okay, thanks, we'll try the new version

alterge1st · 2024-11-12T09:16:32Z

We recently fixed a watch related goroutine leak issue, #18784

The fix will be included in 3.5.17, which is supposed to be released this week. Please try again with the new version once available.

Unfortunately, I modified my local code to follow the latest changes, but the problem persists.

serathius · 2024-11-12T12:43:33Z

Not following whether the issue is etcd or K8s related. In the repro you provide and discuss code for kubernetes API while the folowing debugging is about etcd. I would like to clarify this, because K8s apiserver demultiplexes watch connections to etcd. So the issue with client cancelation should not happen for K8s, as 100 watches opened to apiserver still opens only 1 watch to etcd.

alterge1st · 2024-11-13T07:07:40Z

Not following whether the issue is etcd or K8s related. In the repro you provide and discuss code for kubernetes API while the folowing debugging is about etcd. I would like to clarify this, because K8s apiserver demultiplexes watch connections to etcd. So the issue with client cancelation should not happen for K8s, as 100 watches opened to apiserver still opens only 1 watch to etcd.

Kube-apiserver is integrated with etcd. The in-process server of etcd is used to directly invoke APIs instead of using etcd service ports. This problem occurs when the watch command of K8s is used. When another process is started to create a large number of watch requests for the configmap cyclically, killing the process will cause the watch command of the configmap by the Kubernetes to become invalid.

serathius · 2024-11-13T17:44:01Z

The in-process server of etcd is used to directly invoke APIs instead of using etcd service ports.

This should not change the fact that apiserver will demultiplexes watch, or have you disabled watch cache?

alterge1st · 2024-11-14T08:29:33Z

The in-process server of etcd is used to directly invoke APIs instead of using etcd service ports.

This should not change the fact that apiserver will demultiplexes watch, or have you disabled watch cache?

Yes, we did disable the watch cache.

alterge1st added the type/bug label Nov 12, 2024

serathius added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After a large number of watch connections are disconnected from a client at the same time, the new watch cannot work properly. #18879

After a large number of watch connections are disconnected from a client at the same time, the new watch cannot work properly. #18879

alterge1st commented Nov 12, 2024

paste your configuration here

alterge1st commented Nov 12, 2024

ahrtr commented Nov 12, 2024

alterge1st commented Nov 12, 2024

alterge1st commented Nov 12, 2024

serathius commented Nov 12, 2024 •

edited

Loading

alterge1st commented Nov 13, 2024

serathius commented Nov 13, 2024

alterge1st commented Nov 14, 2024

After a large number of watch connections are disconnected from a client at the same time, the new watch cannot work properly. #18879

After a large number of watch connections are disconnected from a client at the same time, the new watch cannot work properly. #18879

Comments

alterge1st commented Nov 12, 2024

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

alterge1st commented Nov 12, 2024

ahrtr commented Nov 12, 2024

alterge1st commented Nov 12, 2024

alterge1st commented Nov 12, 2024

serathius commented Nov 12, 2024 • edited Loading

alterge1st commented Nov 13, 2024

serathius commented Nov 13, 2024

alterge1st commented Nov 14, 2024

serathius commented Nov 12, 2024 •

edited

Loading