-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(kds): detect properly hanging stream for the same zone #12983
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Lukasz Dziedziak <[email protected]>
Reviewer Checklist🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
|
Signed-off-by: Lukasz Dziedziak <[email protected]>
Signed-off-by: Lukasz Dziedziak <[email protected]>
Signed-off-by: Lukasz Dziedziak <[email protected]>
Signed-off-by: Lukasz Dziedziak <[email protected]>
Should we implement a feature flag to disable it? Please vote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR makes sense. I'm a bit worried if it's going to be a mess when on each connection at least 6 concurrent processes are going to try to update the same ZoneInsight resource. If at least one of them fails others are going to be canceled (and retried on the client).
As it's potentially can be a regression I'd hide this behind the feature flag and disable by default.
// In this case, all instances will try to update Zone Insight which will result in conflicts. | ||
// Since it's unusual to immediately execute envoy admin rpcs after zone is connected, 0-10s delay should be fine. | ||
// #nosec G404 - math rand is enough | ||
time.Sleep(time.Duration(rand.Int31n(10000)) * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the retry below just fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine, but it doesn't solve the issue with the first request. It only improves the probability that the first request won't hit a conflict.
Signed-off-by: Lukasz Dziedziak <[email protected]>
Motivation
In a multi-zone architecture, we run a global control plane and multiple zone control planes. Each zone can have multiple instances running, but only the leader is responsible for communicating with the global control plane.
Currently, if there is an issue with a zone or a proxied stream that does not properly detect the connection closure, we might end up in a situation where a zone control plane restart goes unnoticed. This could result in multiple open streams from a single zone to the global control plane. Since the health check only verifies the zone name and tenant ID, a newly connected instance of the control plane makes all previous health check detectors pass.
This issue occurs when a gRPC client and server, both configured with Keepalive, communicate through a Gateway that is also configured with the gRPC protocol. We can reproduce it with the following setup:
By introducing packet loss on the Zone, we can observe the following log messages in the control plane:
keepalive ping failed to receive ACK within timeout
However, despite this failure, the connection remains open in the control plane, leading to unnecessary resource allocation. This suggests that the Gateway might be misconfigured. In such cases, we should implement safeguards to prevent hanging connections.
Implementation information
We could track the last connection time by recording it in an event and later checking if the same client attempts to reconnect. If a reconnection occurs, we cancel the old stream.
To handle scenarios where multiple instances of the global control plane are running, we will store this information in zoneInsight. Additionally, we will use a periodic verification mechanism (zone watcher) to ensure the recorded time remains accurate.