Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VLM] Calculate maximum number of multi-modal tokens by model #6121

Merged
merged 8 commits into from
Jul 4, 2024

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Jul 3, 2024

This PR further extends the multi-modal registry so that each model can specify its own maximum number of multi-modal tokens during memory profiling.

This replaces the function of the user-provided image_feature_size argument in vision language config that was recently removed by #6089.

@DarkLight1337 DarkLight1337 requested a review from ywang96 July 3, 2024 23:21
vllm/multimodal/base.py Outdated Show resolved Hide resolved
from vllm.multimodal import MULTIMODAL_REGISTRY

@MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the relationship between register_max_tokens and register_dummy_data is a bit intricate. There needs to be certain level of consistency here. Hard to get right. Should we mention something here?

Copy link
Member Author

@DarkLight1337 DarkLight1337 Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently have a note in registry_dummy_data that mentions it should use the max number of tokens from each modality. Is that sufficient?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the two should be tied together for consistency - see my comment below in phi3v.py.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

from vllm.multimodal import MULTIMODAL_REGISTRY

@MULTIMODAL_REGISTRY.register_image_input_mapper()
+ @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the two should be tied together for consistency - see my comment below in phi3v.py.

@@ -321,6 +321,17 @@ def get_phi3v_image_feature_size(
+ (new_height // 336 + 1) * 12


def get_max_phi3v_image_tokens(ctx: InputContext):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_max_phi3v_image_tokens and dummy_data_for_phi3v are both based on dummy_height, dummy_width = 8000, 50, so we should make these constants to this file for consistency. I think this will suffice for the purpose of consistency for now, and in the future we can establish more structured protocol between multimodal feature size and dummy data.

Copy link
Member

@ywang96 ywang96 Jul 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made #6146 to address this.

@ywang96 ywang96 merged commit ae96ef8 into vllm-project:main Jul 4, 2024
70 checks passed
@DarkLight1337 DarkLight1337 deleted the max-tokens branch July 5, 2024 01:53
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants