-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix request cancellation without polling #11190
Conversation
Signed-off-by: Joe Runde <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Turns out the old async llm engine needed to check for cancellation errors for this new method to work, so I shuffled around the base Working now on a test for the openai api server, that doesn't necessarily need to block merging if we're on a tight schedule but should be in soon 🤞 @mgoin @simon-mo Can I get a |
Signed-off-by: Joe Runde <[email protected]>
Okay, the test for request cancellation with the openai server is in as well. It overloads the server with a ton of requests and then cancels them, ensuring the server can still respond after. I would have rather been able to do something like check the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though it does have the potential to be a footgun (as you documented), I like the usage of @with_cancellation
so LGTM pending green!
Ah shoot, the test was too much for the smaller cards we run on in CI and it failed even though the logs do show plenty of requests being aborted :( I'll dial it back some- I don't want this to be flaky so I think fewer requests with more tokens each would be better at piling up load without burdening the server with handling so many aborts |
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
See discussion and alternate solution in #11096
I came to agree with @jakkdl in encode/starlette#2094 and the linked issues that polling for disconnects with
request.is_disconnected()
introduces more problems than it's worth. Instead, we can use the pattern that's already in StreamingResponse and have a separate async task wait for a disconnect message, cancelling our work if one is received. The key here is that aStreamingResponse
is able to safely consume all new messages because the request body has already been read. Our request handlers have the same guarantee, sincefastapi
first reads and parses the request and builds a pdydantic object for us before invoking our handler.This PR implements a decorator for our fastapi handlers that will cancel them if a disconnect message is received while they are running. This is implemented with
asynco
directly instead of withanyio
, because the rest of the code base assumesasyncio
.The advantages here are:
Disadvantages
entrypoints/api_sever.py
because that handler reads the request body itself, so our one cancellation test intests/async_engine/test_api_server.py
fails. I'll need to write a new one, but it's 6pm on a Friday 🙃I manually verified that this works to cancel both streaming and non-streaming requests, let me know what y'all think of doing this instead
FIX #10087