[wip] feat/graceful queue proxy drain 2 #16104

dprotaso · 2025-10-01T01:11:20Z

Testing - #16080

knative-prow · 2025-10-01T01:11:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dprotaso]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2025-10-01T01:20:04Z

Codecov Report

❌ Patch coverage is 81.28655% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.74%. Comparing base (26a8cec) to head (e245733).

Files with missing lines	Patch %	Lines
pkg/queue/sharedmain/main.go	0.00%	23 Missing ⚠️
pkg/queue/breaker.go	68.75%	5 Missing ⚠️
pkg/activator/net/throttler.go	71.42%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #16104      +/-   ##
==========================================
+ Coverage   80.20%   80.74%   +0.53%     
==========================================
  Files         214      215       +1     
  Lines       16887    17038     +151     
==========================================
+ Hits        13544    13757     +213     
+ Misses       2987     2912      -75     
- Partials      356      369      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy undergoes drain. Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See gorilla/websocket#448 and golang/go#17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown. Further, this prevents pre-mature closing of connections in the user container due to misconfigured SIGTERM handling, by delaying the SIGTERM send until the queue-proxy has verified it has fully drained.

The previous implementation had a circular dependency where: - User container PreStop waited for drain-complete file - Queue-proxy only wrote drain-complete after receiving SIGTERM - But SIGTERM was blocked waiting for PreStop to finish This fix implements a two-stage drain signal: 1. Queue-proxy PreStop writes drain-started immediately on pod deletion 2. User container PreStop waits for drain-started (with 3s timeout for safety) 3. Queue-proxy SIGTERM handler drains requests and writes drain-complete 4. User container waits for drain-complete before allowing termination This ensures proper shutdown sequencing without deadlock while still delaying user container termination until queue-proxy has drained. Also includes cleanup of stale drain signal files on queue-proxy startup. feat: improve PreStop drain coordination with exponential backoff - Replace fixed 3-second wait with exponential backoff (1, 2, 4, 8 seconds) - Change drain-complete check interval from 0.1s to 1s to reduce CPU usage - Exit gracefully if drain-started is never detected after retries - More robust handling of queue-proxy failures or slow PreStop execution This provides better resilience against timing issues while reducing unnecessary CPU usage during the wait loop. test: add comprehensive integration tests for shutdown coordination Add integration tests to verify the PreStop shutdown coordination works correctly in various scenarios: - Normal shutdown sequence with proper signal ordering - Queue-proxy crash/failure scenarios - High load conditions with many pending requests - File system permission issues - Race condition testing with 50 iterations - Long-running requests that exceed typical drain timeout These tests ensure the exponential backoff and two-stage drain signal mechanism handles edge cases gracefully. Run with: go test -tags=integration -race ./pkg/queue/sharedmain refactor: extract drain signal paths and logic to shared constants Based on PR review feedback, centralize drain signal configuration: - Create pkg/queue/drain/signals.go with all drain-related constants - Define signal file paths (DrainStartedFile, DrainCompleteFile) - Extract shell script logic into BuildDrainWaitScript() function - Define exponential backoff delays and check intervals as constants - Update all references to use the new constants package This improves code maintainability and makes it easier to modify drain behavior in the future. All file paths and timing parameters are now defined in a single location.

knative-prow · 2025-10-01T02:16:03Z

@dprotaso: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
upgrade-tests_serving_main	`e245733`	link	true	`/test upgrade-tests`

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dprotaso · 2025-10-01T02:40:49Z

Closing see: #16080 (review)

knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2025

knative-prow bot requested review from dsimansk, gauron99 and skonto October 1, 2025 01:11

knative-prow bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 1, 2025

dprotaso force-pushed the feat/graceful-queue-proxy-drain-2 branch from 5655fef to 3765daa Compare October 1, 2025 01:12

dprotaso changed the title ~~feat/graceful queue proxy drain 2~~ [wip] feat/graceful queue proxy drain 2 Oct 1, 2025

knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2025

dprotaso mentioned this pull request Oct 1, 2025

Ensure conections persist until done on queue-proxy drain #16080

Open

elijah-rou added 6 commits September 30, 2025 21:51

test: improve testing suite for new drain logic

a418d66

fix: move compose handler after drainer definition

0406e33

fix: skip knative internal requests for the pending request counter

61d3647

fix: skip external k8/kn probes as well

e245733

dprotaso force-pushed the feat/graceful-queue-proxy-drain-2 branch from c9ecc3d to e245733 Compare October 1, 2025 01:54

dprotaso closed this Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip] feat/graceful queue proxy drain 2 #16104

[wip] feat/graceful queue proxy drain 2 #16104

Uh oh!

dprotaso commented Oct 1, 2025 •

edited

Loading

Uh oh!

knative-prow bot commented Oct 1, 2025

Uh oh!

codecov bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

knative-prow bot commented Oct 1, 2025

Uh oh!

dprotaso commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[wip] feat/graceful queue proxy drain 2 #16104

[wip] feat/graceful queue proxy drain 2 #16104

Uh oh!

Conversation

dprotaso commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knative-prow bot commented Oct 1, 2025

Uh oh!

codecov bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

knative-prow bot commented Oct 1, 2025

Uh oh!

dprotaso commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dprotaso commented Oct 1, 2025 •

edited

Loading

codecov bot commented Oct 1, 2025 •

edited

Loading