fix(kds): detect properly hanging stream for the same zone #12983

lukidzi · 2025-03-03T00:59:30Z

Motivation

In a multi-zone architecture, we run a global control plane and multiple zone control planes. Each zone can have multiple instances running, but only the leader is responsible for communicating with the global control plane.

Currently, if there is an issue with a zone or a proxied stream that does not properly detect the connection closure, we might end up in a situation where a zone control plane restart goes unnoticed. This could result in multiple open streams from a single zone to the global control plane. Since the health check only verifies the zone name and tenant ID, a newly connected instance of the control plane makes all previous health check detectors pass.

This issue occurs when a gRPC client and server, both configured with Keepalive, communicate through a Gateway that is also configured with the gRPC protocol. We can reproduce it with the following setup:

Zone → Gateway (gRPC) → Global

By introducing packet loss on the Zone, we can observe the following log messages in the control plane:

keepalive ping failed to receive ACK within timeout
However, despite this failure, the connection remains open in the control plane, leading to unnecessary resource allocation. This suggests that the Gateway might be misconfigured. In such cases, we should implement safeguards to prevent hanging connections.

Implementation information

We could track the last connection time by recording it in an event and later checking if the same client attempts to reconnect. If a reconnection occurs, we cancel the old stream.

To handle scenarios where multiple instances of the global control plane are running, we will store this information in zoneInsight. Additionally, we will use a periodic verification mechanism (zone watcher) to ensure the recorded time remains accurate.

Signed-off-by: Lukasz Dziedziak <[email protected]>

github-actions · 2025-03-03T01:02:09Z

Reviewer Checklist

🔍 Each of these sections need to be checked by the reviewer of the PR 🔍:
If something doesn't apply please check the box and add a justification if the reason is non obvious.

Is the PR title satisfactory? Is this part of a larger feature and should be grouped using > Changelog?
PR description is clear and complete. It Links to relevant issue as well as docs and UI issues
This will not break child repos: it doesn't hardcode values (.e.g "kumahq" as an image registry)
IPv6 is taken into account (.e.g: no string concatenation of host port)
Tests (Unit test, E2E tests, manual test on universal and k8s)
- Don't forget ci/ labels to run additional/fewer tests
Does this contain a change that needs to be notified to users? In this case, UPGRADE.md should be updated.
Does it need to be backported according to the backporting policy? (this GH action will add "backport" label based on these file globs, if you want to prevent it from adding the "backport" label use no-backport-autolabel label)

Signed-off-by: Lukasz Dziedziak <[email protected]>

lukidzi · 2025-03-04T22:09:11Z

Should we implement a feature flag to disable it? Please vote

lobkovilya

The PR makes sense. I'm a bit worried if it's going to be a mess when on each connection at least 6 concurrent processes are going to try to update the same ZoneInsight resource. If at least one of them fails others are going to be canceled (and retried on the client).

As it's potentially can be a regression I'd hide this behind the feature flag and disable by default.

pkg/kds/mux/zone_watch.go

lahabana · 2025-03-05T14:54:03Z

pkg/kds/mux/zone_sync.go

+	// In this case, all instances will try to update Zone Insight which will result in conflicts.
+	// Since it's unusual to immediately execute envoy admin rpcs after zone is connected, 0-10s delay should be fine.
+	// #nosec G404 - math rand is enough
+	time.Sleep(time.Duration(rand.Int31n(10000)) * time.Millisecond)


Isn't the retry below just fine?

It's fine, but it doesn't solve the issue with the first request. It only improves the probability that the first request won't hit a conflict.

pkg/kds/service/server.go

Signed-off-by: Lukasz Dziedziak <[email protected]>

fix(kds): kill stream for the same zone

6b75912

Signed-off-by: Lukasz Dziedziak <[email protected]>

lukidzi changed the title ~~fix(kds): kill stream for the same zone~~ fix(kds): detect properly hanging stream for the same zone Mar 3, 2025

implement

d24619a

Signed-off-by: Lukasz Dziedziak <[email protected]>

lukidzi added the ci/run-full-matrix PR: Runs all possible e2e test combination (expensive use carefully) label Mar 4, 2025

lukidzi added 3 commits March 4, 2025 16:03

test

aba1d76

Signed-off-by: Lukasz Dziedziak <[email protected]>

change log level

2d6ef76

Signed-off-by: Lukasz Dziedziak <[email protected]>

Merge branch 'master' into kds-stream-id

3ef417b

Signed-off-by: Lukasz Dziedziak <[email protected]>

lukidzi marked this pull request as ready for review March 4, 2025 22:10

lukidzi requested a review from a team as a code owner March 4, 2025 22:10

lukidzi requested review from jijiechen, slonka, lahabana, bartsmykla, lobkovilya and Icarus9913 March 4, 2025 22:10

lobkovilya reviewed Mar 5, 2025

View reviewed changes

pkg/kds/mux/zone_watch.go Outdated Show resolved Hide resolved

lahabana reviewed Mar 5, 2025

View reviewed changes

pkg/kds/service/server.go Show resolved Hide resolved

added flag and refactor

2c88ea0

Signed-off-by: Lukasz Dziedziak <[email protected]>

lukidzi requested review from lahabana and lobkovilya March 7, 2025 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kds): detect properly hanging stream for the same zone #12983

fix(kds): detect properly hanging stream for the same zone #12983

lukidzi commented Mar 3, 2025 •

edited

Loading

github-actions bot commented Mar 3, 2025

lukidzi commented Mar 4, 2025 •

edited

Loading

lobkovilya left a comment

lahabana Mar 5, 2025

lukidzi Mar 5, 2025

fix(kds): detect properly hanging stream for the same zone #12983

Are you sure you want to change the base?

fix(kds): detect properly hanging stream for the same zone #12983

Conversation

lukidzi commented Mar 3, 2025 • edited Loading

Motivation

Implementation information

github-actions bot commented Mar 3, 2025

Reviewer Checklist

lukidzi commented Mar 4, 2025 • edited Loading

lobkovilya left a comment

Choose a reason for hiding this comment

lahabana Mar 5, 2025

Choose a reason for hiding this comment

lukidzi Mar 5, 2025

Choose a reason for hiding this comment

lukidzi commented Mar 3, 2025 •

edited

Loading

lukidzi commented Mar 4, 2025 •

edited

Loading