-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VLM] Calculate maximum number of multi-modal tokens by model #6121
Conversation
from vllm.multimodal import MULTIMODAL_REGISTRY | ||
|
||
@MULTIMODAL_REGISTRY.register_image_input_mapper() | ||
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the relationship between register_max_tokens
and register_dummy_data
is a bit intricate. There needs to be certain level of consistency here. Hard to get right. Should we mention something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently have a note in registry_dummy_data
that mentions it should use the max number of tokens from each modality. Is that sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the two should be tied together for consistency - see my comment below in phi3v.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
from vllm.multimodal import MULTIMODAL_REGISTRY | ||
|
||
@MULTIMODAL_REGISTRY.register_image_input_mapper() | ||
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the two should be tied together for consistency - see my comment below in phi3v.py
.
@@ -321,6 +321,17 @@ def get_phi3v_image_feature_size( | |||
+ (new_height // 336 + 1) * 12 | |||
|
|||
|
|||
def get_max_phi3v_image_tokens(ctx: InputContext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_max_phi3v_image_tokens
and dummy_data_for_phi3v
are both based on dummy_height, dummy_width = 8000, 50
, so we should make these constants to this file for consistency. I think this will suffice for the purpose of consistency for now, and in the future we can establish more structured protocol between multimodal feature size and dummy data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made #6146 to address this.
…roject#6121) Signed-off-by: Alvant <[email protected]>
This PR further extends the multi-modal registry so that each model can specify its own maximum number of multi-modal tokens during memory profiling.
This replaces the function of the user-provided
image_feature_size
argument in vision language config that was recently removed by #6089.