Skip to content

Conversation

dprotaso
Copy link
Member

@dprotaso dprotaso commented Oct 1, 2025

Testing - #16080

Copy link

knative-prow bot commented Oct 1, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2025
@knative-prow knative-prow bot requested review from dsimansk, gauron99 and skonto October 1, 2025 01:11
@knative-prow knative-prow bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 1, 2025
@dprotaso dprotaso force-pushed the feat/graceful-queue-proxy-drain-2 branch from 5655fef to 3765daa Compare October 1, 2025 01:12
@dprotaso dprotaso changed the title feat/graceful queue proxy drain 2 [wip] feat/graceful queue proxy drain 2 Oct 1, 2025
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 1, 2025
Copy link

codecov bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 81.28655% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.74%. Comparing base (26a8cec) to head (e245733).

Files with missing lines Patch % Lines
pkg/queue/sharedmain/main.go 0.00% 23 Missing ⚠️
pkg/queue/breaker.go 68.75% 5 Missing ⚠️
pkg/activator/net/throttler.go 71.42% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16104      +/-   ##
==========================================
+ Coverage   80.20%   80.74%   +0.53%     
==========================================
  Files         214      215       +1     
  Lines       16887    17038     +151     
==========================================
+ Hits        13544    13757     +213     
+ Misses       2987     2912      -75     
- Partials      356      369      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy
undergoes drain.

Due to hijacked connections in net/http not being respected when
server.Shutdown is called, any active websocket connections currently
end as soon as the queue-proxy calls .Shutdown. See
gorilla/websocket#448 and golang/go#17721 for details. This patch fixes
this issue by introducing an atomic counter of active requests, which
increments as a request comes in and decrements as a request handler
terminates. During drain, this counter must reach zero or adhere to the
revision timeout, in order to call .Shutdown.

Further, this prevents pre-mature closing of connections in the user
container due to misconfigured SIGTERM handling, by delaying the SIGTERM
send until the queue-proxy has verified it has fully drained.
The previous implementation had a circular dependency where:
- User container PreStop waited for drain-complete file
- Queue-proxy only wrote drain-complete after receiving SIGTERM
- But SIGTERM was blocked waiting for PreStop to finish

This fix implements a two-stage drain signal:
1. Queue-proxy PreStop writes drain-started immediately on pod deletion
2. User container PreStop waits for drain-started (with 3s timeout for
safety)
3. Queue-proxy SIGTERM handler drains requests and writes drain-complete
4. User container waits for drain-complete before allowing termination

This ensures proper shutdown sequencing without deadlock while still
delaying user container termination until queue-proxy has drained.

Also includes cleanup of stale drain signal files on queue-proxy
startup.

feat: improve PreStop drain coordination with exponential backoff

- Replace fixed 3-second wait with exponential backoff (1, 2, 4, 8
seconds)
- Change drain-complete check interval from 0.1s to 1s to reduce CPU
usage
- Exit gracefully if drain-started is never detected after retries
- More robust handling of queue-proxy failures or slow PreStop execution

This provides better resilience against timing issues while reducing
unnecessary CPU usage during the wait loop.

test: add comprehensive integration tests for shutdown coordination

Add integration tests to verify the PreStop shutdown coordination works
correctly in various scenarios:

- Normal shutdown sequence with proper signal ordering
- Queue-proxy crash/failure scenarios
- High load conditions with many pending requests
- File system permission issues
- Race condition testing with 50 iterations
- Long-running requests that exceed typical drain timeout

These tests ensure the exponential backoff and two-stage drain signal
mechanism handles edge cases gracefully.

Run with: go test -tags=integration -race ./pkg/queue/sharedmain

refactor: extract drain signal paths and logic to shared constants

Based on PR review feedback, centralize drain signal configuration:

- Create pkg/queue/drain/signals.go with all drain-related constants
- Define signal file paths (DrainStartedFile, DrainCompleteFile)
- Extract shell script logic into BuildDrainWaitScript() function
- Define exponential backoff delays and check intervals as constants
- Update all references to use the new constants package

This improves code maintainability and makes it easier to modify
drain behavior in the future. All file paths and timing parameters
are now defined in a single location.
@dprotaso dprotaso force-pushed the feat/graceful-queue-proxy-drain-2 branch from c9ecc3d to e245733 Compare October 1, 2025 01:54
Copy link

knative-prow bot commented Oct 1, 2025

@dprotaso: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
upgrade-tests_serving_main e245733 link true /test upgrade-tests

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dprotaso
Copy link
Member Author

dprotaso commented Oct 1, 2025

Closing see: #16080 (review)

@dprotaso dprotaso closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants