Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
22 changes: 19 additions & 3 deletions vllm/worker/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,9 +255,25 @@ def determine_num_available_blocks(self) -> Tuple[int, int]:
def _assert_memory_footprint_increased_during_profiling(self):
# NOTE(woosuk): Here we assume that the other processes using the same
# GPU did not change their memory usage during the profiling.
free_gpu_memory, _ = torch.cuda.mem_get_info()
assert self.init_gpu_memory - free_gpu_memory > 0, (
"Error in memory profiling. "
free_gpu_memory, total_memory = torch.cuda.mem_get_info()
memory_diff = self.init_gpu_memory - free_gpu_memory

# If we've loaded model weights but memory shows no change,
# we're likely in a restricted environment
model_loaded = hasattr(self.model_runner, 'model')
memory_is_static = memory_diff == 0

is_restricted_env = model_loaded and memory_is_static
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this part. In what situation we will and will not have self.modell_runner.model when calling this function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part just checks if model just loaded and memory didn't change , that proves that memory is static:

This is a more reliable indicator because:

We know the model must use memory
If memory appears static despite model loading, it's definitely a restricted environment (Like cloud where we don't have access so it'll pass the assessment since it would throw errors. )

It worked with my H100 model on the cloud. (so it passed the cloud test) im not sure how to test the other condition , the one that returns this error :

This happens when the GPU memory was not properly cleaned up before initializing the vLLM instance.

Copy link
Author

@Ahmed14z Ahmed14z Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be clearer if there is an expert in torch that can check it out, since this issue is not a simple bug , it means we can't run VLLM if we're on a cloud that doesn't give us memory management access.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I still don't fully understand why we use hasattr to check whether the model is loaded or not, your statement of "it means we can't run VLLM if we're on a cloud that doesn't give us memory management access." seems a case to me. One solution we could have is offering an optional argument that allows you to specify the number of GPU blocks by yourself. When this is specified, we bypass the profiling and assume you will have that many blocks to use, and may OOM if you ask for too many blocks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you mean! i added a check for num_gpu_blocks_override and it's working pretty good now.


if is_restricted_env:
logger.info("Detected restricted GPU environment. "
"Model is loaded but memory reports static usage. "
"Free memory: %.2fGB, Total memory: %.2fGB",
free_gpu_memory / (1024**3),
total_memory / (1024**3))

assert memory_diff > 0 or is_restricted_env, (
"Error in memory profiling."
f"Initial free memory {self.init_gpu_memory}, current free memory"
f" {free_gpu_memory}. This happens when the GPU memory was "
"not properly cleaned up before initializing the vLLM instance.")
Expand Down
Loading