-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996
Open
Ahmed14z
wants to merge
8
commits into
vllm-project:main
Choose a base branch
from
Ahmed14z:fix-9232
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[Core] Enhance memory profiling in determine_num_available_blocks with error handling and fallback #9996
Changes from 5 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
ed05f26
[Core] Enhance memory profiling in determine_num_available_blocks (#9…
Ahmed14z e2fc283
[Core] Refactor memory profiling based on review feedback
Ahmed14z be35fb8
Revert "[Core] Enhance memory profiling in determine_num_available_bl…
Ahmed14z 930a8eb
[Lint] Refactor logging to remove f-string usage for compliance
Ahmed14z 18f25a3
[Lint] Adjust logging to use % formatting for compliance
Ahmed14z 86c41e9
[Core] Revert previous implementation and update worker to check for …
Ahmed14z c1dea49
[Core] Revert previous implementation and update worker to check for …
Ahmed14z 1d976f2
[Lint] Adjust logging to be less than 80
Ahmed14z File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this part. In what situation we will and will not have
self.modell_runner.model
when calling this function?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part just checks if model just loaded and memory didn't change , that proves that memory is static:
This is a more reliable indicator because:
We know the model must use memory
If memory appears static despite model loading, it's definitely a restricted environment (Like cloud where we don't have access so it'll pass the assessment since it would throw errors. )
It worked with my H100 model on the cloud. (so it passed the cloud test) im not sure how to test the other condition , the one that returns this error :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be clearer if there is an expert in torch that can check it out, since this issue is not a simple bug , it means we can't run VLLM if we're on a cloud that doesn't give us memory management access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I still don't fully understand why we use
hasattr
to check whether the model is loaded or not, your statement of "it means we can't run VLLM if we're on a cloud that doesn't give us memory management access." seems a case to me. One solution we could have is offering an optional argument that allows you to specify the number of GPU blocks by yourself. When this is specified, we bypass the profiling and assume you will have that many blocks to use, and may OOM if you ask for too many blocks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you mean! i added a check for num_gpu_blocks_override and it's working pretty good now.