-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] GPU memory profiling should be per LLM instance #10498
[Bugfix] GPU memory profiling should be per LLM instance #10498
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
The current changes implement the quickfix I suggested in #10451 (comment), but, during validation, I noticed another consequence. Even with this PR, the memory allocation will measure all Torch memory allocated by the python instance, not just the memory allocated for the instance of the model being profiled. In other words, it fixes multiple vLLM servers sharing the GPU but not multiple LLMs within a single vLLM instance. So support for that is still a TODO. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
Signed-off-by: Travis Johnson <[email protected]>
9e8fda0
to
e2788e5
Compare
what is this? |
By "multiple LLMs within a single vLLM instance" I mean creating mutliple I'm not sure if it is a common use-case, but it was tested in |
gpu_memory_utilization
was intended to limit the total memory allocation for an instance of anLLM
. An update to the memory profiling changed the meaning of this parameter to be a global limit on GPU memory allocation (see this comment).This PR reverts the change
gpu_memory_utilization
back to being a per-model-instance limit.It also simplifies the memory profiling code, but removes some information from the "Memory profiling results" log message. I'm open to feedback on adding back more information.
Refer to #10451 for more background and discussion.
FIX #10451