-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[wip] feat/graceful queue proxy drain 2 #16104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dprotaso The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
5655fef
to
3765daa
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16104 +/- ##
==========================================
+ Coverage 80.20% 80.74% +0.53%
==========================================
Files 214 215 +1
Lines 16887 17038 +151
==========================================
+ Hits 13544 13757 +213
+ Misses 2987 2912 -75
- Partials 356 369 +13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy undergoes drain. Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See gorilla/websocket#448 and golang/go#17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown. Further, this prevents pre-mature closing of connections in the user container due to misconfigured SIGTERM handling, by delaying the SIGTERM send until the queue-proxy has verified it has fully drained.
The previous implementation had a circular dependency where: - User container PreStop waited for drain-complete file - Queue-proxy only wrote drain-complete after receiving SIGTERM - But SIGTERM was blocked waiting for PreStop to finish This fix implements a two-stage drain signal: 1. Queue-proxy PreStop writes drain-started immediately on pod deletion 2. User container PreStop waits for drain-started (with 3s timeout for safety) 3. Queue-proxy SIGTERM handler drains requests and writes drain-complete 4. User container waits for drain-complete before allowing termination This ensures proper shutdown sequencing without deadlock while still delaying user container termination until queue-proxy has drained. Also includes cleanup of stale drain signal files on queue-proxy startup. feat: improve PreStop drain coordination with exponential backoff - Replace fixed 3-second wait with exponential backoff (1, 2, 4, 8 seconds) - Change drain-complete check interval from 0.1s to 1s to reduce CPU usage - Exit gracefully if drain-started is never detected after retries - More robust handling of queue-proxy failures or slow PreStop execution This provides better resilience against timing issues while reducing unnecessary CPU usage during the wait loop. test: add comprehensive integration tests for shutdown coordination Add integration tests to verify the PreStop shutdown coordination works correctly in various scenarios: - Normal shutdown sequence with proper signal ordering - Queue-proxy crash/failure scenarios - High load conditions with many pending requests - File system permission issues - Race condition testing with 50 iterations - Long-running requests that exceed typical drain timeout These tests ensure the exponential backoff and two-stage drain signal mechanism handles edge cases gracefully. Run with: go test -tags=integration -race ./pkg/queue/sharedmain refactor: extract drain signal paths and logic to shared constants Based on PR review feedback, centralize drain signal configuration: - Create pkg/queue/drain/signals.go with all drain-related constants - Define signal file paths (DrainStartedFile, DrainCompleteFile) - Extract shell script logic into BuildDrainWaitScript() function - Define exponential backoff delays and check intervals as constants - Update all references to use the new constants package This improves code maintainability and makes it easier to modify drain behavior in the future. All file paths and timing parameters are now defined in a single location.
c9ecc3d
to
e245733
Compare
@dprotaso: The following test failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Closing see: #16080 (review) |
Testing - #16080