-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Ensure conections persist until done on queue-proxy drain #16080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Ensure conections persist until done on queue-proxy drain #16080
Conversation
Hi @elijah-rou. Thanks for your PR. I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: elijah-rou The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
860f81b
to
f03e9f6
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16080 +/- ##
==========================================
+ Coverage 80.20% 80.76% +0.55%
==========================================
Files 214 215 +1
Lines 16887 17038 +151
==========================================
+ Hits 13544 13760 +216
+ Misses 2987 2914 -73
- Partials 356 364 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/retest |
@elijah-rou: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
e5815b0
to
af1bf73
Compare
/ok-to-test |
Signed-off-by: Knative Automation <[email protected]>
bumping knative.dev/hack f88b7db...af735b2: > af735b2 Fix dot releases (# 434) Signed-off-by: Knative Automation <[email protected]>
Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy undergoes drain. Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See gorilla/websocket#448 and golang/go#17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown. Further, this prevents pre-mature closing of connections in the user container due to misconfigured SIGTERM handling, by delaying the SIGTERM send until the queue-proxy has verified it has fully drained.
The previous implementation had a circular dependency where: - User container PreStop waited for drain-complete file - Queue-proxy only wrote drain-complete after receiving SIGTERM - But SIGTERM was blocked waiting for PreStop to finish This fix implements a two-stage drain signal: 1. Queue-proxy PreStop writes drain-started immediately on pod deletion 2. User container PreStop waits for drain-started (with 3s timeout for safety) 3. Queue-proxy SIGTERM handler drains requests and writes drain-complete 4. User container waits for drain-complete before allowing termination This ensures proper shutdown sequencing without deadlock while still delaying user container termination until queue-proxy has drained. Also includes cleanup of stale drain signal files on queue-proxy startup. feat: improve PreStop drain coordination with exponential backoff - Replace fixed 3-second wait with exponential backoff (1, 2, 4, 8 seconds) - Change drain-complete check interval from 0.1s to 1s to reduce CPU usage - Exit gracefully if drain-started is never detected after retries - More robust handling of queue-proxy failures or slow PreStop execution This provides better resilience against timing issues while reducing unnecessary CPU usage during the wait loop. test: add comprehensive integration tests for shutdown coordination Add integration tests to verify the PreStop shutdown coordination works correctly in various scenarios: - Normal shutdown sequence with proper signal ordering - Queue-proxy crash/failure scenarios - High load conditions with many pending requests - File system permission issues - Race condition testing with 50 iterations - Long-running requests that exceed typical drain timeout These tests ensure the exponential backoff and two-stage drain signal mechanism handles edge cases gracefully. Run with: go test -tags=integration -race ./pkg/queue/sharedmain refactor: extract drain signal paths and logic to shared constants Based on PR review feedback, centralize drain signal configuration: - Create pkg/queue/drain/signals.go with all drain-related constants - Define signal file paths (DrainStartedFile, DrainCompleteFile) - Extract shell script logic into BuildDrainWaitScript() function - Define exponential backoff delays and check intervals as constants - Update all references to use the new constants package This improves code maintainability and makes it easier to modify drain behavior in the future. All file paths and timing parameters are now defined in a single location.
af1bf73
to
7a48a83
Compare
There are empty aliases in OWNER_ALIASES, cleanup is advised. |
/retest |
I'm going to drop some of the extra commits in this PR - it makes the diff a bit confusing |
I cherry-picked commits into a this PR #16104 - that drops all the extra vendor changes in this PR. If you feel I haven't dropped anything important feel free to update this PR my force pushing over. One observation I have is that the upgrade tests seem to be failing due to the changes. You can see it here: https://prow.knative.dev/pr-history/?org=knative&repo=serving&pr=16080 The ProbeTest ensures we don't drop traffic when updating Knative components including the Revision Pods.
In CI is pretty stable - https://testgrid.k8s.io/r/knative-own-testgrid/serving#continuous&width=90&include-filter-by-regex=ProbeTest |
@elijah-rou: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diff seems to have several unrelated changes in the activator/net/throttler & queue/breaker. If those are intentional we might want to pull that into separate PRs.
I also don't think we want to incorporate shell scripts/volumes into our draining mechanism. For that to work it would mean we require a shell in our queue proxy container which we don't want because that increases cold start when there's an image pull. Secondly it would require a shell in all the user containers which is not something we can guarantee in existing end-user workloads. It's also unclear how the empty volume affects cold start for the containers that don't use websocket.
Thus I think this PR is a non-starter in terms of the approach.
One thing I'm wondering about - would it be simple to keep the existing drain mechanism we have and have the queue proxy to simply send a close frame to the application?
https://pkg.go.dev/github.com/gorilla/websocket#FormatCloseMessage
if capacity > 0x7FFFFFFF { | ||
return 0x7FFFFFFF // Return max int32 value | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support 32bit architectures - so int == int64
|
||
type breaker interface { | ||
Capacity() int | ||
Capacity() uint64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anything motivating this change? This seems unrelated to web socket drain?
I've also tried something like this before but you end up with a lot of casting that I didn't feel like it was worth it because int==int64 on the arch's we support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just changed it because of a TODO comment for changing the type, figured I would just get it done. I can revert if you feel as though it is not worth the trouble
concurrency := ib.concurrency.Load() | ||
// Safe conversion: concurrency is int32 and we check for non-negative | ||
if concurrency >= 0 { | ||
return uint64(concurrency) | ||
} | ||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infinite breaker is only really 0 or 1 so I don't think we need this extra conversion check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linter was complaining which is why I did this.
"/bin/sh", "-c", | ||
drain.QueueProxyPreStopScript, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is sorta a non-starter. The queue proxy base image doesn't include any shell. This is intentional to keep the image small given it affects cold starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It includes sh
. This is what we are running now in production
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't - I'm guessing if you're rebuilding with a different base image?
docker run -ti ghcr.io/wolfi-dev/static:alpine /bin/sh
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
Run 'docker run --help' for more information
Checking the latest release
docker run -ti --entrypoint /bin/sh gcr.io/knative-releases/knative.dev/serving/cmd/queue@sha256:f155665f3a5faad07ab4dbe7e010790383936e5864344ccca7f40bbb2fa0f30b
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
Run 'docker run --help' for more information
userAgent := r.Header.Get("User-Agent") | ||
return strings.HasPrefix(userAgent, netheader.ActivatorUserAgent) || | ||
strings.HasPrefix(userAgent, netheader.AutoscalingUserAgent) || | ||
strings.HasPrefix(userAgent, netheader.QueueProxyUserAgent) || | ||
strings.HasPrefix(userAgent, netheader.IngressReadinessUserAgent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the autoscaler and activator etc all use K-Network-Probe
header which is covered by netheader.IsProbe(r)
Do you know what exactly is getting through here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure exactly, https://prow.knative.dev/view/gs/knative-prow/pr-logs/pull/knative_serving/16080/upgrade-tests_serving_main/1973128182416543744 is complaining about it. in practice, I have not observed a problem
/hold Does your websocket e2e drain test reliable fail? I'm inclined that we merge that in via a separate PR but we skip the test by default until someone introduces a fix. |
I'm wondering if we need to do something like this for websocket handling in the queue proxy |
I'll take a look |
Fixes: Websockets (and some HTTP) closing abruptly when queue-proxy
undergoes drain.
Due to hijacked connections in net/http not being respected when
server.Shutdown is called, any active websocket connections currently
end as soon as the queue-proxy calls .Shutdown. See
gorilla/websocket#448 and golang/go#17721 for details. This patch fixes
this issue by introducing an atomic counter of active requests, which
increments as a request comes in and decrements as a request handler
terminates. During drain, this counter must reach zero or adhere to the
revision timeout, in order to call .Shutdown.
Further, this prevents pre-mature closing of connections in the user
container due to misconfigured SIGTERM handling, by delaying the SIGTERM
send until the queue-proxy has verified it has fully drained.