-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted old changes and added this check in _assert_memory_footprint_increased_during_profiling , our main issue was being in a non controlled environment like a cloud!
i added instead this check mechanism which should differentiate between cloud and normal platforms If we've loaded model weights but memory shows no change what so ever , after research i found this happens in non controlled environment like a cloud , but still I'm no expert in this area and need an expert's review in it.
…lm-project#9232) Signed-off-by: Ahmed Mansy <[email protected]>
Signed-off-by: Ahmed Mansy <[email protected]>
…ocks (vllm-project#9232)" This reverts commit 581804f. Revert "[Core] Enhance memory profiling in determine_num_available_blocks (vllm-project#9232)" This reverts commit 581804f. Reason: The initial implementation introduced a heuristic fallback with 70% GPU memory utilization, which may not be optimal for all configurations. There is a risk of overriding user configurations silently, and the fallback may lead to sub-optimal performance or out-of-memory (OOM) errors. This revert allows us to take a more robust approach that avoids these issues and provides clearer behavior for users configuring GPU memory utilization. Signed-off-by: Ahmed Mansy <[email protected]>
Signed-off-by: Ahmed Mansy <[email protected]>
Signed-off-by: Ahmed Mansy <[email protected]>
vllm/worker/worker.py
Outdated
# If we've loaded model weights but memory shows no change, | ||
# we're likely in a restricted environment | ||
model_loaded = hasattr(self.model_runner, 'model') | ||
memory_is_static = memory_diff == 0 | ||
|
||
is_restricted_env = model_loaded and memory_is_static |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this part. In what situation we will and will not have self.modell_runner.model
when calling this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part just checks if model just loaded and memory didn't change , that proves that memory is static:
This is a more reliable indicator because:
We know the model must use memory
If memory appears static despite model loading, it's definitely a restricted environment (Like cloud where we don't have access so it'll pass the assessment since it would throw errors. )
It worked with my H100 model on the cloud. (so it passed the cloud test) im not sure how to test the other condition , the one that returns this error :
This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be clearer if there is an expert in torch that can check it out, since this issue is not a simple bug , it means we can't run VLLM if we're on a cloud that doesn't give us memory management access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I still don't fully understand why we use hasattr
to check whether the model is loaded or not, your statement of "it means we can't run VLLM if we're on a cloud that doesn't give us memory management access." seems a case to me. One solution we could have is offering an optional argument that allows you to specify the number of GPU blocks by yourself. When this is specified, we bypass the profiling and assume you will have that many blocks to use, and may OOM if you ask for too many blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you mean! i added a check for num_gpu_blocks_override and it's working pretty good now.
…GPU blocks override Signed-off-by: Ahmed Mansy <[email protected]>
…GPU blocks override Signed-off-by: Ahmed Mansy <[email protected]>
Signed-off-by: Ahmed Mansy <[email protected]>
why do we need to support this? if profiling run fails, I think that's indeed a failure. |
This pull request has merge conflicts that must be resolved before it can be |
PR Title
[Core] Enhance memory profiling in
determine_num_available_blocks
with error handling and fallbackDescription
Fixes #9232
This PR improves the reliability of the
determine_num_available_blocks
function by introducing error handling and a fallback mechanism for memory profiling. The main enhancements include:try-except
block to catch any exceptions that occur during the profiling process.These updates provide more robust memory handling in the vLLM engine, reducing the likelihood of memory-related issues.
Changes
try-except
block todetermine_num_available_blocks
to gracefully handle errors during memory profiling.Checklist
[Core]
prefix.format.sh
).docs/source/
for any modified user-facing behaviors.Additional Notes
This PR enhances the robustness of vLLM's memory profiling by ensuring that
determine_num_available_blocks
can handle profiling errors gracefully. The fallback mechanism offers a reliable default for GPU memory allocation, enhancing the engine's stability, especially in constrained memory environments.Let me know if further modifications are needed.