Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kds): detect properly hanging stream for the same zone #12983

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lukidzi
Copy link
Contributor

@lukidzi lukidzi commented Mar 3, 2025

Motivation

In a multi-zone architecture, we run a global control plane and multiple zone control planes. Each zone can have multiple instances running, but only the leader is responsible for communicating with the global control plane.

Currently, if there is an issue with a zone or a proxied stream that does not properly detect the connection closure, we might end up in a situation where a zone control plane restart goes unnoticed. This could result in multiple open streams from a single zone to the global control plane. Since the health check only verifies the zone name and tenant ID, a newly connected instance of the control plane makes all previous health check detectors pass.

This issue occurs when a gRPC client and server, both configured with Keepalive, communicate through a Gateway that is also configured with the gRPC protocol. We can reproduce it with the following setup:

Zone → Gateway (gRPC) → Global

By introducing packet loss on the Zone, we can observe the following log messages in the control plane:

keepalive ping failed to receive ACK within timeout
However, despite this failure, the connection remains open in the control plane, leading to unnecessary resource allocation. This suggests that the Gateway might be misconfigured. In such cases, we should implement safeguards to prevent hanging connections.

Implementation information

We could track the last connection time by recording it in an event and later checking if the same client attempts to reconnect. If a reconnection occurs, we cancel the old stream.

To handle scenarios where multiple instances of the global control plane are running, we will store this information in zoneInsight. Additionally, we will use a periodic verification mechanism (zone watcher) to ensure the recorded time remains accurate.

@lukidzi lukidzi changed the title fix(kds): kill stream for the same zone fix(kds): detect properly hanging stream for the same zone Mar 3, 2025
Copy link
Contributor

github-actions bot commented Mar 3, 2025

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

  • Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
  • PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
  • This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
  • IPv6 is taken into account (.e.g: no string concatenation of host port)
  • Tests (Unit test, E2E tests, manual test on universal and k8s)
    • Don't forget ci/ labels to run additional/fewer tests
  • Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
  • Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

Signed-off-by: Lukasz Dziedziak <[email protected]>
@lukidzi lukidzi added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label Mar 4, 2025
lukidzi added 3 commits March 4, 2025 16:03
Signed-off-by: Lukasz Dziedziak <[email protected]>
Signed-off-by: Lukasz Dziedziak <[email protected]>
@lukidzi
Copy link
Contributor Author

lukidzi commented Mar 4, 2025

Should we implement a feature flag to disable it? Please vote

@lukidzi lukidzi marked this pull request as ready for review March 4, 2025 22:10
@lukidzi lukidzi requested a review from a team as a code owner March 4, 2025 22:10
Copy link
Contributor

@lobkovilya lobkovilya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR makes sense. I'm a bit worried if it's going to be a mess when on each connection at least 6 concurrent processes are going to try to update the same ZoneInsight resource. If at least one of them fails others are going to be canceled (and retried on the client).

As it's potentially can be a regression I'd hide this behind the feature flag and disable by default.

// In this case, all instances will try to update Zone Insight which will result in conflicts.
// Since it's unusual to immediately execute envoy admin rpcs after zone is connected, 0-10s delay should be fine.
// #nosec G404 - math rand is enough
time.Sleep(time.Duration(rand.Int31n(10000)) * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the retry below just fine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, but it doesn't solve the issue with the first request. It only improves the probability that the first request won't hit a conflict.

Signed-off-by: Lukasz Dziedziak <[email protected]>
@lukidzi lukidzi requested review from lahabana and lobkovilya March 7, 2025 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants