[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996

Ahmed14z · 2024-11-04T16:48:23Z

PR Title

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback

Description

Fixes #9232

This PR improves the reliability of the determine_num_available_blocks function by introducing error handling and a fallback mechanism for memory profiling. The main enhancements include:

Error Handling: The memory profiling code is wrapped in a try-except block to catch any exceptions that occur during the profiling process.
Fallback Calculation: A fallback mechanism calculates a safe number of GPU blocks using 70% of the total GPU memory if profiling encounters an error. This ensures functionality continues even if profiling fails.

These updates provide more robust memory handling in the vLLM engine, reducing the likelihood of memory-related issues.

Changes

Added a try-except block to determine_num_available_blocks to gracefully handle errors during memory profiling.
Introduced a fallback calculation for GPU blocks, which provides a safe default allocation in the event of a profiling error.

Checklist

PR Title follows convention with [Core] prefix.
Code follows the Google Python style guide.
Passes all linter checks (using format.sh).
Relevant tests added or updated.
Documentation updated in docs/source/ for any modified user-facing behaviors.

Additional Notes

This PR enhances the robustness of vLLM's memory profiling by ensuring that determine_num_available_blocks can handle profiling errors gracefully. The fallback mechanism offers a reliable default for GPU memory allocation, enhancing the engine's stability, especially in constrained memory environments.

Let me know if further modifications are needed.

github-actions · 2024-11-04T16:48:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/worker/worker.py

Ahmed14z

I reverted old changes and added this check in _assert_memory_footprint_increased_during_profiling , our main issue was being in a non controlled environment like a cloud!

i added instead this check mechanism which should differentiate between cloud and normal platforms If we've loaded model weights but memory shows no change what so ever , after research i found this happens in non controlled environment like a cloud , but still I'm no expert in this area and need an expert's review in it.

…lm-project#9232) Signed-off-by: Ahmed Mansy <[email protected]>

Signed-off-by: Ahmed Mansy <[email protected]>

…ocks (vllm-project#9232)" This reverts commit 581804f. Revert "[Core] Enhance memory profiling in determine_num_available_blocks (vllm-project#9232)" This reverts commit 581804f. Reason: The initial implementation introduced a heuristic fallback with 70% GPU memory utilization, which may not be optimal for all configurations. There is a risk of overriding user configurations silently, and the fallback may lead to sub-optimal performance or out-of-memory (OOM) errors. This revert allows us to take a more robust approach that avoids these issues and provides clearer behavior for users configuring GPU memory utilization. Signed-off-by: Ahmed Mansy <[email protected]>

Signed-off-by: Ahmed Mansy <[email protected]>

comaniac · 2024-11-04T20:31:00Z

vllm/worker/worker.py

+        # If we've loaded model weights but memory shows no change,
+        # we're likely in a restricted environment
+        model_loaded = hasattr(self.model_runner, 'model')
+        memory_is_static = memory_diff == 0
+
+        is_restricted_env = model_loaded and memory_is_static


I don't understand this part. In what situation we will and will not have self.modell_runner.model when calling this function?

This part just checks if model just loaded and memory didn't change , that proves that memory is static:

This is a more reliable indicator because:

We know the model must use memory
If memory appears static despite model loading, it's definitely a restricted environment (Like cloud where we don't have access so it'll pass the assessment since it would throw errors. )

It worked with my H100 model on the cloud. (so it passed the cloud test) im not sure how to test the other condition , the one that returns this error :

This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

It'll be clearer if there is an expert in torch that can check it out, since this issue is not a simple bug , it means we can't run VLLM if we're on a cloud that doesn't give us memory management access.

While I still don't fully understand why we use hasattr to check whether the model is loaded or not, your statement of "it means we can't run VLLM if we're on a cloud that doesn't give us memory management access." seems a case to me. One solution we could have is offering an optional argument that allows you to specify the number of GPU blocks by yourself. When this is specified, we bypass the profiling and assume you will have that many blocks to use, and may OOM if you ask for too many blocks.

I get what you mean! i added a check for num_gpu_blocks_override and it's working pretty good now.

…GPU blocks override Signed-off-by: Ahmed Mansy <[email protected]>

Signed-off-by: Ahmed Mansy <[email protected]>

youkaichao · 2024-11-06T01:36:46Z

This ensures functionality continues even if profiling fails.

why do we need to support this? if profiling run fails, I think that's indeed a failure.

mergify · 2024-11-20T19:22:59Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ahmed14z.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Ahmed14z requested review from WoosukKwon, zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners November 4, 2024 16:48

comaniac reviewed Nov 4, 2024

View reviewed changes

vllm/worker/worker.py Outdated Show resolved Hide resolved

Ahmed14z commented Nov 4, 2024

View reviewed changes

Ahmed14z force-pushed the fix-9232 branch from 2301e12 to f1c4f01 Compare November 4, 2024 19:49

Ahmed14z requested a review from comaniac November 4, 2024 19:52

Ahmed14z added 5 commits November 4, 2024 22:03

[Core] Enhance memory profiling in determine_num_available_blocks (vl…

ed05f26

…lm-project#9232) Signed-off-by: Ahmed Mansy <[email protected]>

[Core] Refactor memory profiling based on review feedback

e2fc283

Signed-off-by: Ahmed Mansy <[email protected]>

[Lint] Refactor logging to remove f-string usage for compliance

930a8eb

Signed-off-by: Ahmed Mansy <[email protected]>

[Lint] Adjust logging to use % formatting for compliance

18f25a3

Signed-off-by: Ahmed Mansy <[email protected]>

Ahmed14z force-pushed the fix-9232 branch from f1c4f01 to 18f25a3 Compare November 4, 2024 20:03

comaniac reviewed Nov 4, 2024

View reviewed changes

Ahmed14z added 3 commits November 5, 2024 00:17

[Core] Revert previous implementation and update worker to check for …

86c41e9

…GPU blocks override Signed-off-by: Ahmed Mansy <[email protected]>

[Core] Revert previous implementation and update worker to check for …

c1dea49

…GPU blocks override Signed-off-by: Ahmed Mansy <[email protected]>

[Lint] Adjust logging to be less than 80

1d976f2

Signed-off-by: Ahmed Mansy <[email protected]>

Ahmed14z requested a review from comaniac November 4, 2024 22:28

mergify bot added the needs-rebase label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996

Ahmed14z commented Nov 4, 2024

github-actions bot commented Nov 4, 2024

Ahmed14z left a comment •

edited

Loading

comaniac Nov 4, 2024

Ahmed14z Nov 4, 2024

Ahmed14z Nov 4, 2024 •

edited

Loading

comaniac Nov 4, 2024

Ahmed14z Nov 4, 2024

youkaichao commented Nov 6, 2024

mergify bot commented Nov 20, 2024

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996

Are you sure you want to change the base?

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996

Conversation

Ahmed14z commented Nov 4, 2024

PR Title

Description

Changes

Checklist

Additional Notes

github-actions bot commented Nov 4, 2024

Ahmed14z left a comment • edited Loading

Choose a reason for hiding this comment

comaniac Nov 4, 2024

Choose a reason for hiding this comment

Ahmed14z Nov 4, 2024

Choose a reason for hiding this comment

Ahmed14z Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

comaniac Nov 4, 2024

Choose a reason for hiding this comment

Ahmed14z Nov 4, 2024

Choose a reason for hiding this comment

youkaichao commented Nov 6, 2024

mergify bot commented Nov 20, 2024

Ahmed14z left a comment •

edited

Loading

Ahmed14z Nov 4, 2024 •

edited

Loading