[Bugfix] GPU memory profiling should be per LLM instance #10498

tjohnson31415 · 2024-11-20T19:08:14Z

gpu_memory_utilization was intended to limit the total memory allocation for an instance of an LLM. An update to the memory profiling changed the meaning of this parameter to be a global limit on GPU memory allocation (see this comment).

This PR reverts the change gpu_memory_utilization back to being a per-model-instance limit.

It also simplifies the memory profiling code, but removes some information from the "Memory profiling results" log message. I'm open to feedback on adding back more information.

Refer to #10451 for more background and discussion.

FIX #10451

github-actions · 2024-11-20T19:08:27Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

tjohnson31415 · 2024-11-20T19:19:13Z

The current changes implement the quickfix I suggested in #10451 (comment), but, during validation, I noticed another consequence. Even with this PR, the memory allocation will measure all Torch memory allocated by the python instance, not just the memory allocated for the instance of the model being profiled. In other words, it fixes multiple vLLM servers sharing the GPU but not multiple LLMs within a single vLLM instance. So support for that is still a TODO.

mergify · 2024-11-21T00:13:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjohnson31415.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Travis Johnson <[email protected]>

youkaichao · 2024-11-21T02:19:47Z

multiple LLMs within a single vLLM instance

what is this?

tjohnson31415 · 2024-11-21T15:37:58Z

By "multiple LLMs within a single vLLM instance" I mean creating mutliple LLM()s within the same python/torch context. I should have called it a "single Python instance" instead of "single vLLM instance".

I'm not sure if it is a common use-case, but it was tested in test_lazy_outlines at the time the memory profiling changes were merged, i.e. same as the current test but with these lines to delete the first LLM() commented out.

mergify bot added the needs-rebase label Nov 21, 2024

tjohnson31415 changed the title ~~[Bugfix] exclude other GPU memory from total_allocated_bytes~~ [Bugfix] GPU memory profiling should be per LLM instance Nov 21, 2024

tjohnson31415 added 3 commits November 20, 2024 17:16

fix: exclude other GPU memory from total_allocated_bytes

2b90be5

Signed-off-by: Travis Johnson <[email protected]>

refactor memory profiling to track against baseline

d1e7c2a

Signed-off-by: Travis Johnson <[email protected]>

simplify memory profiling code

e2788e5

Signed-off-by: Travis Johnson <[email protected]>

tjohnson31415 force-pushed the fix-gpu-utilization branch from 9e8fda0 to e2788e5 Compare November 21, 2024 00:23

mergify bot removed the needs-rebase label Nov 21, 2024

tjohnson31415 closed this Nov 25, 2024

tjohnson31415 deleted the fix-gpu-utilization branch November 25, 2024 16:19

tjohnson31415 restored the fix-gpu-utilization branch November 25, 2024 16:20

zhekazuev mentioned this pull request Nov 27, 2024

[bitnami/vllm] feat: Add new container bitnami/containers#75274

Closed

joerunde mentioned this pull request Dec 12, 2024

[Core] Fix memory profiling #11120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] GPU memory profiling should be per LLM instance #10498

[Bugfix] GPU memory profiling should be per LLM instance #10498

tjohnson31415 commented Nov 20, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 20, 2024

tjohnson31415 commented Nov 20, 2024

mergify bot commented Nov 21, 2024

youkaichao commented Nov 21, 2024

tjohnson31415 commented Nov 21, 2024

[Bugfix] GPU memory profiling should be per LLM instance #10498

[Bugfix] GPU memory profiling should be per LLM instance #10498

Conversation

tjohnson31415 commented Nov 20, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 20, 2024

tjohnson31415 commented Nov 20, 2024

mergify bot commented Nov 21, 2024

youkaichao commented Nov 21, 2024

tjohnson31415 commented Nov 21, 2024

tjohnson31415 commented Nov 20, 2024 •

edited by github-actions bot

Loading