Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] GPU memory profiling should be per LLM instance #10498

Closed

Conversation

tjohnson31415
Copy link
Contributor

@tjohnson31415 tjohnson31415 commented Nov 20, 2024

gpu_memory_utilization was intended to limit the total memory allocation for an instance of an LLM. An update to the memory profiling changed the meaning of this parameter to be a global limit on GPU memory allocation (see this comment).

This PR reverts the change gpu_memory_utilization back to being a per-model-instance limit.

It also simplifies the memory profiling code, but removes some information from the "Memory profiling results" log message. I'm open to feedback on adding back more information.

Refer to #10451 for more background and discussion.

FIX #10451

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@tjohnson31415
Copy link
Contributor Author

The current changes implement the quickfix I suggested in #10451 (comment), but, during validation, I noticed another consequence. Even with this PR, the memory allocation will measure all Torch memory allocated by the python instance, not just the memory allocated for the instance of the model being profiled. In other words, it fixes multiple vLLM servers sharing the GPU but not multiple LLMs within a single vLLM instance. So support for that is still a TODO.

Copy link

mergify bot commented Nov 21, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjohnson31415.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 21, 2024
@tjohnson31415 tjohnson31415 changed the title [Bugfix] exclude other GPU memory from total_allocated_bytes [Bugfix] GPU memory profiling should be per LLM instance Nov 21, 2024
@youkaichao
Copy link
Member

multiple LLMs within a single vLLM instance

what is this?

@tjohnson31415
Copy link
Contributor Author

By "multiple LLMs within a single vLLM instance" I mean creating mutliple LLM()s within the same python/torch context. I should have called it a "single Python instance" instead of "single vLLM instance".

I'm not sure if it is a common use-case, but it was tested in test_lazy_outlines at the time the memory profiling changes were merged, i.e. same as the current test but with these lines to delete the first LLM() commented out.

@tjohnson31415 tjohnson31415 deleted the fix-gpu-utilization branch November 25, 2024 16:19
@tjohnson31415 tjohnson31415 restored the fix-gpu-utilization branch November 25, 2024 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Breaking Change in gpu_memory_utilization Behavior in vLLM 0.6.4
2 participants