Ensure conections persist until done on queue-proxy drain #16080

elijah-rou · 2025-09-12T15:19:48Z

Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy
undergoes drain.

Due to hijacked connections in net/http not being respected when
server.Shutdown is called, any active websocket connections currently
end as soon as the queue-proxy calls .Shutdown. See
gorilla/websocket#448 and golang/go#17721 for details. This patch fixes
this issue by introducing an atomic counter of active requests, which
increments as a request comes in and decrements as a request handler
terminates. During drain, this counter must reach zero or adhere to the
revision timeout, in order to call .Shutdown.

Further, this prevents pre-mature closing of connections in the user
container due to misconfigured SIGTERM handling, by delaying the SIGTERM
send until the queue-proxy has verified it has fully drained.

knative-prow · 2025-09-12T15:19:58Z

Hi @elijah-rou. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

knative-prow · 2025-09-12T15:20:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elijah-rou
Once this PR has been reviewed and has the lgtm label, please assign skonto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-09-12T15:55:36Z

Codecov Report

❌ Patch coverage is 81.28655% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.76%. Comparing base (26a8cec) to head (1437e6f).

Files with missing lines	Patch %	Lines
pkg/queue/sharedmain/main.go	0.00%	23 Missing ⚠️
pkg/queue/breaker.go	68.75%	5 Missing ⚠️
pkg/activator/net/throttler.go	71.42%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16080      +/-   ##
==========================================
+ Coverage   80.20%   80.76%   +0.55%     
==========================================
  Files         214      215       +1     
  Lines       16887    17038     +151     
==========================================
+ Hits        13544    13760     +216     
+ Misses       2987     2914      -73     
- Partials      356      364       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

elijah-rou · 2025-09-12T17:20:11Z

/retest

knative-prow · 2025-09-12T17:20:27Z

@elijah-rou: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dprotaso · 2025-09-15T17:07:44Z

/ok-to-test

Signed-off-by: Knative Automation <[email protected]>

bumping knative.dev/hack f88b7db...af735b2: > af735b2 Fix dot releases (# 434) Signed-off-by: Knative Automation <[email protected]>

Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy undergoes drain. Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See gorilla/websocket#448 and golang/go#17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown. Further, this prevents pre-mature closing of connections in the user container due to misconfigured SIGTERM handling, by delaying the SIGTERM send until the queue-proxy has verified it has fully drained.

The previous implementation had a circular dependency where: - User container PreStop waited for drain-complete file - Queue-proxy only wrote drain-complete after receiving SIGTERM - But SIGTERM was blocked waiting for PreStop to finish This fix implements a two-stage drain signal: 1. Queue-proxy PreStop writes drain-started immediately on pod deletion 2. User container PreStop waits for drain-started (with 3s timeout for safety) 3. Queue-proxy SIGTERM handler drains requests and writes drain-complete 4. User container waits for drain-complete before allowing termination This ensures proper shutdown sequencing without deadlock while still delaying user container termination until queue-proxy has drained. Also includes cleanup of stale drain signal files on queue-proxy startup. feat: improve PreStop drain coordination with exponential backoff - Replace fixed 3-second wait with exponential backoff (1, 2, 4, 8 seconds) - Change drain-complete check interval from 0.1s to 1s to reduce CPU usage - Exit gracefully if drain-started is never detected after retries - More robust handling of queue-proxy failures or slow PreStop execution This provides better resilience against timing issues while reducing unnecessary CPU usage during the wait loop. test: add comprehensive integration tests for shutdown coordination Add integration tests to verify the PreStop shutdown coordination works correctly in various scenarios: - Normal shutdown sequence with proper signal ordering - Queue-proxy crash/failure scenarios - High load conditions with many pending requests - File system permission issues - Race condition testing with 50 iterations - Long-running requests that exceed typical drain timeout These tests ensure the exponential backoff and two-stage drain signal mechanism handles edge cases gracefully. Run with: go test -tags=integration -race ./pkg/queue/sharedmain refactor: extract drain signal paths and logic to shared constants Based on PR review feedback, centralize drain signal configuration: - Create pkg/queue/drain/signals.go with all drain-related constants - Define signal file paths (DrainStartedFile, DrainCompleteFile) - Extract shell script logic into BuildDrainWaitScript() function - Define exponential backoff delays and check intervals as constants - Update all references to use the new constants package This improves code maintainability and makes it easier to modify drain behavior in the future. All file paths and timing parameters are now defined in a single location.

knative-prow · 2025-09-30T16:48:43Z

There are empty aliases in OWNER_ALIASES, cleanup is advised.

dprotaso · 2025-10-01T00:48:27Z

/retest

dprotaso · 2025-10-01T00:49:41Z

I'm going to drop some of the extra commits in this PR - it makes the diff a bit confusing

dprotaso · 2025-10-01T01:25:27Z

I cherry-picked commits into a this PR #16104 - that drops all the extra vendor changes in this PR. If you feel I haven't dropped anything important feel free to update this PR my force pushing over.

One observation I have is that the upgrade tests seem to be failing due to the changes.

You can see it here: https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=16080

The ProbeTest ensures we don't drop traffic when updating Knative components including the Revision Pods.

prober.go:171: "http://upgrade-probe.serving-tests.example.com" status = 502, want: 200
prober.go:172: response: status: 502, body: dial tcp 127.0.0.1:8080: connect: connection refused
..
prober.go:186: Stopping all probers
probe.go:63: CheckSLO() error SLI for "TestServingUpgrades/Run/ProbeTest" = 0.999738, wanted >= 1.000000

In CI is pretty stable - https://testgrid.k8s.io/r/knative-own-testgrid/serving#continuous&width=90&include-filter-by-regex=ProbeTest

knative-prow · 2025-10-01T02:07:32Z

@elijah-rou: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
upgrade-tests_serving_main	`1437e6f`	link	true	`/test upgrade-tests`
istio-latest-no-mesh_serving_main	`1437e6f`	link	true	`/test istio-latest-no-mesh`

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dprotaso

This diff seems to have several unrelated changes in the activator/net/throttler & queue/breaker. If those are intentional we might want to pull that into separate PRs.

I also don't think we want to incorporate shell scripts/volumes into our draining mechanism. For that to work it would mean we require a shell in our queue proxy container which we don't want because that increases cold start when there's an image pull. Secondly it would require a shell in all the user containers which is not something we can guarantee in existing end-user workloads. It's also unclear how the empty volume affects cold start for the containers that don't use websocket.

Thus I think this PR is a non-starter in terms of the approach.

One thing I'm wondering about - would it be simple to keep the existing drain mechanism we have and have the queue proxy to simply send a close frame to the application?

https://pkg.go.dev/github.com/gorilla/websocket#FormatCloseMessage

dprotaso · 2025-10-01T01:56:20Z

pkg/activator/net/throttler.go

+	if capacity > 0x7FFFFFFF {
+		return 0x7FFFFFFF // Return max int32 value
+	}


We don't support 32bit architectures - so int == int64

dprotaso · 2025-10-01T01:57:33Z

pkg/activator/net/throttler.go


 type breaker interface {
-	Capacity() int
+	Capacity() uint64


Anything motivating this change? This seems unrelated to web socket drain?

I've also tried something like this before but you end up with a lot of casting that I didn't feel like it was worth it because int==int64 on the arch's we support

Oh I just changed it because of a TODO comment for changing the type, figured I would just get it done. I can revert if you feel as though it is not worth the trouble

dprotaso · 2025-10-01T01:59:01Z

pkg/activator/net/throttler.go

+	concurrency := ib.concurrency.Load()
+	// Safe conversion: concurrency is int32 and we check for non-negative
+	if concurrency >= 0 {
+		return uint64(concurrency)
+	}
+	return 0


infinite breaker is only really 0 or 1 so I don't think we need this extra conversion check

The linter was complaining which is why I did this.

dprotaso · 2025-10-01T02:07:51Z

pkg/reconciler/revision/resources/queue.go

+						"/bin/sh", "-c",
+						drain.QueueProxyPreStopScript,


This is sorta a non-starter. The queue proxy base image doesn't include any shell. This is intentional to keep the image small given it affects cold starts.

It includes sh. This is what we are running now in production

It shouldn't - I'm guessing if you're rebuilding with a different base image?

docker run -ti ghcr.io/wolfi-dev/static:alpine /bin/sh docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown Run 'docker run --help' for more information

https://github.com/knative/serving/blob/1ffe339257c09178ac0f8a3b8b28badd04c8272b/.ko.yaml#L2C19-L2C50

Checking the latest release

docker run -ti --entrypoint /bin/sh gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:f155665f3a5faad07ab4dbe7e010790383936e5864344ccca7f40bbb2fa0f30b docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown Run 'docker run --help' for more information

dprotaso · 2025-10-01T02:16:07Z

pkg/queue/sharedmain/handlers.go

+	userAgent := r.Header.Get("User-Agent")
+	return strings.HasPrefix(userAgent, netheader.ActivatorUserAgent) ||
+		strings.HasPrefix(userAgent, netheader.AutoscalingUserAgent) ||
+		strings.HasPrefix(userAgent, netheader.QueueProxyUserAgent) ||
+		strings.HasPrefix(userAgent, netheader.IngressReadinessUserAgent)


I thought the autoscaler and activator etc all use K-Network-Probe header which is covered by netheader.IsProbe(r)

Do you know what exactly is getting through here?

not sure exactly, https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/16080/upgrade-tests_serving_main/1973128182416543744 is complaining about it. in practice, I have not observed a problem

dprotaso · 2025-10-01T02:48:01Z

/hold

Does your websocket e2e drain test reliable fail? I'm inclined that we merge that in via a separate PR but we skip the test by default until someone introduces a fix.

dprotaso · 2025-10-01T14:39:40Z

I'm wondering if we need to do something like this for websocket handling in the queue proxy

https://go.dev/play/p/RwdLe7OXaPj

elijah-rou · 2025-10-02T15:54:55Z

I'm wondering if we need to do something like this for websocket handling in the queue proxy

go.dev/play/p/RwdLe7OXaPj

I'll take a look

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025

knative-prow bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 12, 2025

knative-prow bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 12, 2025

knative-prow bot requested review from dprotaso and skonto September 12, 2025 15:20

elijah-rou mentioned this pull request Sep 12, 2025

Ensure websocket conections persist until done on queue-proxy drain #15759

Closed

elijah-rou force-pushed the feat/graceful-queue-proxy-drain branch from 860f81b to f03e9f6 Compare September 12, 2025 15:42

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025

elijah-rou force-pushed the feat/graceful-queue-proxy-drain branch from e5815b0 to af1bf73 Compare September 12, 2025 17:40

knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 15, 2025

knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 16, 2025

knative-automation and others added 8 commits September 30, 2025 12:46

upgrade to latest dependencies (knative#16035)

9fa4ff6

Signed-off-by: Knative Automation <[email protected]>

upgrade to latest dependencies (knative#16053)

91d5b0a

bumping knative.dev/hack f88b7db...af735b2: > af735b2 Fix dot releases (# 434) Signed-off-by: Knative Automation <[email protected]>

chore: update go.mod

0750a4d

chore: update codegen

962f690

test: improve testing suite for new drain logic

b1f26b8

chore: run goimports

f6dd4b9

elijah-rou force-pushed the feat/graceful-queue-proxy-drain branch from af1bf73 to 7a48a83 Compare September 30, 2025 16:46

knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2025

fix: inconsistent vendoring

eab43de

elijah-rou added 4 commits September 30, 2025 15:13

chore: update codegen

f26e7d4

fix: move compose handler after drainer definition

eb165e5

fix: skip knative internal requests for the pending request counter

e71e1cb

fix: skip external k8/kn probes as well

1437e6f

dprotaso mentioned this pull request Oct 1, 2025

[wip] feat/graceful queue proxy drain 2 #16104

Closed

dprotaso reviewed Oct 1, 2025

View reviewed changes

knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025

Ensure conections persist until done on queue-proxy drain #16080

Are you sure you want to change the base?

Ensure conections persist until done on queue-proxy drain #16080

Uh oh!

Conversation

elijah-rou commented Sep 12, 2025

Uh oh!

knative-prow bot commented Sep 12, 2025

Uh oh!

knative-prow bot commented Sep 12, 2025

Uh oh!

codecov bot commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

elijah-rou commented Sep 12, 2025

Uh oh!

knative-prow bot commented Sep 12, 2025

Uh oh!

dprotaso commented Sep 15, 2025

Uh oh!

knative-prow bot commented Sep 30, 2025

Uh oh!

dprotaso commented Oct 1, 2025

Uh oh!

dprotaso commented Oct 1, 2025

Uh oh!

dprotaso commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knative-prow bot commented Oct 1, 2025

Uh oh!

dprotaso left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dprotaso commented Oct 1, 2025

Uh oh!

dprotaso commented Oct 1, 2025

Uh oh!

elijah-rou commented Oct 2, 2025

Uh oh!

Uh oh!

codecov bot commented Sep 12, 2025 •

edited

Loading

dprotaso commented Oct 1, 2025 •

edited

Loading