Skip to content

Conversation

elijah-rou
Copy link
Contributor

Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy
undergoes drain.

Due to hijacked connections in net/http not being respected when
server.Shutdown is called, any active websocket connections currently
end as soon as the queue-proxy calls .Shutdown. See
gorilla/websocket#448 and golang/go#17721 for details. This patch fixes
this issue by introducing an atomic counter of active requests, which
increments as a request comes in and decrements as a request handler
terminates. During drain, this counter must reach zero or adhere to the
revision timeout, in order to call .Shutdown.

Further, this prevents pre-mature closing of connections in the user
container due to misconfigured SIGTERM handling, by delaying the SIGTERM
send until the queue-proxy has verified it has fully drained.

@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025
@knative-prow knative-prow bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 12, 2025
Copy link

knative-prow bot commented Sep 12, 2025

Hi @elijah-rou. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 12, 2025
Copy link

knative-prow bot commented Sep 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elijah-rou
Once this PR has been reviewed and has the lgtm label, please assign skonto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot requested review from dprotaso and skonto September 12, 2025 15:20
@elijah-rou elijah-rou force-pushed the feat/graceful-queue-proxy-drain branch from 860f81b to f03e9f6 Compare September 12, 2025 15:42
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025
Copy link

codecov bot commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 81.28655% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.76%. Comparing base (26a8cec) to head (1437e6f).

Files with missing lines Patch % Lines
pkg/queue/sharedmain/main.go 0.00% 23 Missing ⚠️
pkg/queue/breaker.go 68.75% 5 Missing ⚠️
pkg/activator/net/throttler.go 71.42% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16080      +/-   ##
==========================================
+ Coverage   80.20%   80.76%   +0.55%     
==========================================
  Files         214      215       +1     
  Lines       16887    17038     +151     
==========================================
+ Hits        13544    13760     +216     
+ Misses       2987     2914      -73     
- Partials      356      364       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@elijah-rou
Copy link
Contributor Author

/retest

Copy link

knative-prow bot commented Sep 12, 2025

@elijah-rou: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@elijah-rou elijah-rou force-pushed the feat/graceful-queue-proxy-drain branch from e5815b0 to af1bf73 Compare September 12, 2025 17:40
@dprotaso
Copy link
Member

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 15, 2025
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 16, 2025
knative-automation and others added 8 commits September 30, 2025 12:46
bumping knative.dev/hack f88b7db...af735b2:
  > af735b2 Fix dot releases (# 434)

Signed-off-by: Knative Automation <[email protected]>
Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy
undergoes drain.

Due to hijacked connections in net/http not being respected when
server.Shutdown is called, any active websocket connections currently
end as soon as the queue-proxy calls .Shutdown. See
gorilla/websocket#448 and golang/go#17721 for details. This patch fixes
this issue by introducing an atomic counter of active requests, which
increments as a request comes in and decrements as a request handler
terminates. During drain, this counter must reach zero or adhere to the
revision timeout, in order to call .Shutdown.

Further, this prevents pre-mature closing of connections in the user
container due to misconfigured SIGTERM handling, by delaying the SIGTERM
send until the queue-proxy has verified it has fully drained.
The previous implementation had a circular dependency where:
- User container PreStop waited for drain-complete file
- Queue-proxy only wrote drain-complete after receiving SIGTERM
- But SIGTERM was blocked waiting for PreStop to finish

This fix implements a two-stage drain signal:
1. Queue-proxy PreStop writes drain-started immediately on pod deletion
2. User container PreStop waits for drain-started (with 3s timeout for
safety)
3. Queue-proxy SIGTERM handler drains requests and writes drain-complete
4. User container waits for drain-complete before allowing termination

This ensures proper shutdown sequencing without deadlock while still
delaying user container termination until queue-proxy has drained.

Also includes cleanup of stale drain signal files on queue-proxy
startup.

feat: improve PreStop drain coordination with exponential backoff

- Replace fixed 3-second wait with exponential backoff (1, 2, 4, 8
seconds)
- Change drain-complete check interval from 0.1s to 1s to reduce CPU
usage
- Exit gracefully if drain-started is never detected after retries
- More robust handling of queue-proxy failures or slow PreStop execution

This provides better resilience against timing issues while reducing
unnecessary CPU usage during the wait loop.

test: add comprehensive integration tests for shutdown coordination

Add integration tests to verify the PreStop shutdown coordination works
correctly in various scenarios:

- Normal shutdown sequence with proper signal ordering
- Queue-proxy crash/failure scenarios
- High load conditions with many pending requests
- File system permission issues
- Race condition testing with 50 iterations
- Long-running requests that exceed typical drain timeout

These tests ensure the exponential backoff and two-stage drain signal
mechanism handles edge cases gracefully.

Run with: go test -tags=integration -race ./pkg/queue/sharedmain

refactor: extract drain signal paths and logic to shared constants

Based on PR review feedback, centralize drain signal configuration:

- Create pkg/queue/drain/signals.go with all drain-related constants
- Define signal file paths (DrainStartedFile, DrainCompleteFile)
- Extract shell script logic into BuildDrainWaitScript() function
- Define exponential backoff delays and check intervals as constants
- Update all references to use the new constants package

This improves code maintainability and makes it easier to modify
drain behavior in the future. All file paths and timing parameters
are now defined in a single location.
@elijah-rou elijah-rou force-pushed the feat/graceful-queue-proxy-drain branch from af1bf73 to 7a48a83 Compare September 30, 2025 16:46
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 30, 2025
Copy link

knative-prow bot commented Sep 30, 2025

There are empty aliases in OWNER_ALIASES, cleanup is advised.

@dprotaso
Copy link
Member

dprotaso commented Oct 1, 2025

/retest

@dprotaso
Copy link
Member

dprotaso commented Oct 1, 2025

I'm going to drop some of the extra commits in this PR - it makes the diff a bit confusing

@dprotaso
Copy link
Member

dprotaso commented Oct 1, 2025

I cherry-picked commits into a this PR #16104 - that drops all the extra vendor changes in this PR. If you feel I haven't dropped anything important feel free to update this PR my force pushing over.

One observation I have is that the upgrade tests seem to be failing due to the changes.

You can see it here: https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=16080

The ProbeTest ensures we don't drop traffic when updating Knative components including the Revision Pods.

prober.go:171: "http://upgrade-probe.serving-tests.example.com" status = 502, want: 200
prober.go:172: response: status: 502, body: dial tcp 127.0.0.1:8080: connect: connection refused
..
prober.go:186: Stopping all probers
probe.go:63: CheckSLO() error SLI for "TestServingUpgrades/Run/ProbeTest" = 0.999738, wanted >= 1.000000

In CI is pretty stable - https://testgrid.k8s.io/r/knative-own-testgrid/serving#continuous&width=90&include-filter-by-regex=ProbeTest

Copy link

knative-prow bot commented Oct 1, 2025

@elijah-rou: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
upgrade-tests_serving_main 1437e6f link true /test upgrade-tests
istio-latest-no-mesh_serving_main 1437e6f link true /test istio-latest-no-mesh

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff seems to have several unrelated changes in the activator/net/throttler & queue/breaker. If those are intentional we might want to pull that into separate PRs.

I also don't think we want to incorporate shell scripts/volumes into our draining mechanism. For that to work it would mean we require a shell in our queue proxy container which we don't want because that increases cold start when there's an image pull. Secondly it would require a shell in all the user containers which is not something we can guarantee in existing end-user workloads. It's also unclear how the empty volume affects cold start for the containers that don't use websocket.

Thus I think this PR is a non-starter in terms of the approach.

One thing I'm wondering about - would it be simple to keep the existing drain mechanism we have and have the queue proxy to simply send a close frame to the application?

https://pkg.go.dev/github.com/gorilla/websocket#FormatCloseMessage

Comment on lines +106 to +108
if capacity > 0x7FFFFFFF {
return 0x7FFFFFFF // Return max int32 value
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support 32bit architectures - so int == int64


type breaker interface {
Capacity() int
Capacity() uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything motivating this change? This seems unrelated to web socket drain?

I've also tried something like this before but you end up with a lot of casting that I didn't feel like it was worth it because int==int64 on the arch's we support

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just changed it because of a TODO comment for changing the type, figured I would just get it done. I can revert if you feel as though it is not worth the trouble

Comment on lines +731 to +736
concurrency := ib.concurrency.Load()
// Safe conversion: concurrency is int32 and we check for non-negative
if concurrency >= 0 {
return uint64(concurrency)
}
return 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infinite breaker is only really 0 or 1 so I don't think we need this extra conversion check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter was complaining which is why I did this.

Comment on lines +371 to +372
"/bin/sh", "-c",
drain.QueueProxyPreStopScript,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sorta a non-starter. The queue proxy base image doesn't include any shell. This is intentional to keep the image small given it affects cold starts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It includes sh. This is what we are running now in production

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't - I'm guessing if you're rebuilding with a different base image?

docker run -ti ghcr.io/wolfi-dev/static:alpine /bin/sh
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown

Run 'docker run --help' for more information

https://github.com/knative/serving/blob/1ffe339257c09178ac0f8a3b8b28badd04c8272b/.ko.yaml#L2C19-L2C50

Checking the latest release

docker run -ti --entrypoint /bin/sh gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:f155665f3a5faad07ab4dbe7e010790383936e5864344ccca7f40bbb2fa0f30b
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown

Run 'docker run --help' for more information

Comment on lines +174 to +178
userAgent := r.Header.Get("User-Agent")
return strings.HasPrefix(userAgent, netheader.ActivatorUserAgent) ||
strings.HasPrefix(userAgent, netheader.AutoscalingUserAgent) ||
strings.HasPrefix(userAgent, netheader.QueueProxyUserAgent) ||
strings.HasPrefix(userAgent, netheader.IngressReadinessUserAgent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the autoscaler and activator etc all use K-Network-Probe header which is covered by netheader.IsProbe(r)

Do you know what exactly is getting through here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure exactly, https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/16080/upgrade-tests_serving_main/1973128182416543744 is complaining about it. in practice, I have not observed a problem

@dprotaso
Copy link
Member

dprotaso commented Oct 1, 2025

/hold

Does your websocket e2e drain test reliable fail? I'm inclined that we merge that in via a separate PR but we skip the test by default until someone introduces a fix.

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025
@dprotaso
Copy link
Member

dprotaso commented Oct 1, 2025

I'm wondering if we need to do something like this for websocket handling in the queue proxy

https://go.dev/play/p/RwdLe7OXaPj

@elijah-rou
Copy link
Contributor Author

I'm wondering if we need to do something like this for websocket handling in the queue proxy

go.dev/play/p/RwdLe7OXaPj

I'll take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants