Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Ahmed14z
Copy link

@Ahmed14z Ahmed14z commented Nov 4, 2024

PR Title

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback

Description

Fixes #9232

This PR improves the reliability of the determine_num_available_blocks function by introducing error handling and a fallback mechanism for memory profiling. The main enhancements include:

  • Error Handling: The memory profiling code is wrapped in a try-except block to catch any exceptions that occur during the profiling process.
  • Fallback Calculation: A fallback mechanism calculates a safe number of GPU blocks using 70% of the total GPU memory if profiling encounters an error. This ensures functionality continues even if profiling fails.

These updates provide more robust memory handling in the vLLM engine, reducing the likelihood of memory-related issues.

Changes

  • Added a try-except block to determine_num_available_blocks to gracefully handle errors during memory profiling.
  • Introduced a fallback calculation for GPU blocks, which provides a safe default allocation in the event of a profiling error.

Checklist

  • PR Title follows convention with [Core] prefix.
  • Code follows the Google Python style guide.
  • Passes all linter checks (using format.sh).
  • Relevant tests added or updated.
  • Documentation updated in docs/source/ for any modified user-facing behaviors.

Additional Notes

This PR enhances the robustness of vLLM's memory profiling by ensuring that determine_num_available_blocks can handle profiling errors gracefully. The fallback mechanism offers a reliable default for GPU memory allocation, enhancing the engine's stability, especially in constrained memory environments.

Let me know if further modifications are needed.

Copy link

github-actions bot commented Nov 4, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

vllm/worker/worker.py Outdated Show resolved Hide resolved
Copy link
Author

@Ahmed14z Ahmed14z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted old changes and added this check in _assert_memory_footprint_increased_during_profiling , our main issue was being in a non controlled environment like a cloud!

i added instead this check mechanism which should differentiate between cloud and normal platforms If we've loaded model weights but memory shows no change what so ever , after research i found this happens in non controlled environment like a cloud , but still I'm no expert in this area and need an expert's review in it.

…ocks (vllm-project#9232)"

This reverts commit 581804f.

Revert "[Core] Enhance memory profiling in determine_num_available_blocks (vllm-project#9232)"

This reverts commit 581804f.

Reason: The initial implementation introduced a heuristic fallback with 70% GPU memory utilization, which may not be optimal for all configurations. There is a risk of overriding user configurations silently, and the fallback may lead to sub-optimal performance or out-of-memory (OOM) errors. This revert allows us to take a more robust approach that avoids these issues and provides clearer behavior for users configuring GPU memory utilization.
Signed-off-by: Ahmed Mansy <[email protected]>
Comment on lines 261 to 266
# If we've loaded model weights but memory shows no change,
# we're likely in a restricted environment
model_loaded = hasattr(self.model_runner, 'model')
memory_is_static = memory_diff == 0

is_restricted_env = model_loaded and memory_is_static
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part. In what situation we will and will not have self.modell_runner.model when calling this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part just checks if model just loaded and memory didn't change , that proves that memory is static:

This is a more reliable indicator because:

We know the model must use memory
If memory appears static despite model loading, it's definitely a restricted environment (Like cloud where we don't have access so it'll pass the assessment since it would throw errors. )

It worked with my H100 model on the cloud. (so it passed the cloud test) im not sure how to test the other condition , the one that returns this error :

This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

Copy link
Author

@Ahmed14z Ahmed14z Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be clearer if there is an expert in torch that can check it out, since this issue is not a simple bug , it means we can't run VLLM if we're on a cloud that doesn't give us memory management access.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I still don't fully understand why we use hasattr to check whether the model is loaded or not, your statement of "it means we can't run VLLM if we're on a cloud that doesn't give us memory management access." seems a case to me. One solution we could have is offering an optional argument that allows you to specify the number of GPU blocks by yourself. When this is specified, we bypass the profiling and assume you will have that many blocks to use, and may OOM if you ask for too many blocks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you mean! i added a check for num_gpu_blocks_override and it's working pretty good now.

@Ahmed14z Ahmed14z requested a review from comaniac November 4, 2024 22:28
@youkaichao
Copy link
Member

This ensures functionality continues even if profiling fails.

why do we need to support this? if profiling run fails, I think that's indeed a failure.

Copy link

mergify bot commented Nov 20, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ahmed14z.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants