Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix request cancellation without polling #11190

Merged
merged 9 commits into from
Dec 17, 2024

Conversation

joerunde
Copy link
Collaborator

@joerunde joerunde commented Dec 14, 2024

See discussion and alternate solution in #11096

I came to agree with @jakkdl in encode/starlette#2094 and the linked issues that polling for disconnects with request.is_disconnected() introduces more problems than it's worth. Instead, we can use the pattern that's already in StreamingResponse and have a separate async task wait for a disconnect message, cancelling our work if one is received. The key here is that a StreamingResponse is able to safely consume all new messages because the request body has already been read. Our request handlers have the same guarantee, since fastapi first reads and parses the request and builds a pdydantic object for us before invoking our handler.

This PR implements a decorator for our fastapi handlers that will cancel them if a disconnect message is received while they are running. This is implemented with asynco directly instead of with anyio, because the rest of the code base assumes asyncio.

The advantages here are:

  • No extra polling overhead and need for configuring timeouts and poll intervals
  • Work can be canceled as soon as a disconnect happens, instead of at the next disconnect check
  • This is completely transparent to our code, and we can remove the existing cancelation handling that's coupled inside the server logic

Disadvantages

  • This doesn't work with entrypoints/api_sever.py because that handler reads the request body itself, so our one cancellation test in tests/async_engine/test_api_server.py fails. I'll need to write a new one, but it's 6pm on a Friday 🙃

I manually verified that this works to cancel both streaming and non-streaming requests, let me know what y'all think of doing this instead

FIX #10087

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added the frontend label Dec 14, 2024
@DarkLight1337 DarkLight1337 requested a review from njhill December 14, 2024 02:20
@joerunde
Copy link
Collaborator Author

Turns out the old async llm engine needed to check for cancellation errors for this new method to work, so I shuffled around the base api_server.py to hook up the new decorator so our test with it + the async llm engine works again.

Working now on a test for the openai api server, that doesn't necessarily need to block merging if we're on a tight schedule but should be in soon 🤞

@mgoin @simon-mo Can I get a ready on this to get the full CI suite running?

@joerunde
Copy link
Collaborator Author

Okay, the test for request cancellation with the openai server is in as well. It overloads the server with a ton of requests and then cancels them, ensuring the server can still respond after. I would have rather been able to do something like check the /metrics output for aborted requests, but those aren't currently tracked by our metrics.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 16, 2024
mgoin
mgoin approved these changes Dec 16, 2024
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it does have the potential to be a footgun (as you documented), I like the usage of @with_cancellation so LGTM pending green!

tests/entrypoints/openai/test_basic.py Show resolved Hide resolved
@joerunde
Copy link
Collaborator Author

Ah shoot, the test was too much for the smaller cards we run on in CI and it failed even though the logs do show plenty of requests being aborted :(

I'll dial it back some- I don't want this to be flaky so I think fewer requests with more tokens each would be better at piling up load without burdening the server with handling so many aborts

@simon-mo simon-mo merged commit 2d1b9ba into vllm-project:main Dec 17, 2024
53 of 55 checks passed
SageMoore pushed a commit to neuralmagic/vllm that referenced this pull request Dec 19, 2024
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Requests aren't aborted when client disconnects if user provided a middleware
3 participants