[VLM] Calculate maximum number of multi-modal tokens by model #6121

DarkLight1337 · 2024-07-03T23:21:13Z

This PR further extends the multi-modal registry so that each model can specify its own maximum number of multi-modal tokens during memory profiling.

This replaces the function of the user-provided image_feature_size argument in vision language config that was recently removed by #6089.

vllm/multimodal/base.py

xwjiang2010 · 2024-07-03T23:59:16Z

docs/source/dev/multimodal/adding_multimodal_model.rst

+      from vllm.multimodal import MULTIMODAL_REGISTRY
+
+      @MULTIMODAL_REGISTRY.register_image_input_mapper()
+    + @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)


IMO the relationship between register_max_tokens and register_dummy_data is a bit intricate. There needs to be certain level of consistency here. Hard to get right. Should we mention something here?

I currently have a note in registry_dummy_data that mentions it should use the max number of tokens from each modality. Is that sufficient?

IMO the two should be tied together for consistency - see my comment below in phi3v.py.

ywang96

LGTM!

ywang96 · 2024-07-04T23:32:20Z

docs/source/dev/multimodal/adding_multimodal_model.rst

+      from vllm.multimodal import MULTIMODAL_REGISTRY
+
+      @MULTIMODAL_REGISTRY.register_image_input_mapper()
+    + @MULTIMODAL_REGISTRY.register_max_image_tokens(<your_calculation>)


IMO the two should be tied together for consistency - see my comment below in phi3v.py.

ywang96 · 2024-07-04T23:36:26Z

vllm/model_executor/models/phi3v.py

@@ -321,6 +321,17 @@ def get_phi3v_image_feature_size(
        + (new_height // 336 + 1) * 12


+def get_max_phi3v_image_tokens(ctx: InputContext):


get_max_phi3v_image_tokens and dummy_data_for_phi3v are both based on dummy_height, dummy_width = 8000, 50, so we should make these constants to this file for consistency. I think this will suffice for the purpose of consistency for now, and in the future we can establish more structured protocol between multimodal feature size and dummy data.

I've made #6146 to address this.

…roject#6121)

…roject#6121) Signed-off-by: Alvant <[email protected]>

Allow different number of image tokens by model

67d8720

DarkLight1337 requested a review from ywang96 July 3, 2024 23:21

DarkLight1337 added 2 commits July 3, 2024 23:29

Remove unnecessary see also

8987afa

Improve indentation

1be8d3d

xwjiang2010 reviewed Jul 3, 2024

View reviewed changes

vllm/multimodal/base.py Outdated Show resolved Hide resolved

Fix docs

99758ef

xwjiang2010 reviewed Jul 3, 2024

View reviewed changes

DarkLight1337 added 4 commits July 4, 2024 04:52

Merge branch 'upstream' into max-tokens

d6fe62b

Move default calculation to individual plugins

8578571

Fix validation

f47689e

More validation

a7a6a61

This was referenced Jul 4, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

multimodal support #2757

Closed

ywang96 approved these changes Jul 4, 2024

View reviewed changes

ywang96 merged commit ae96ef8 into vllm-project:main Jul 4, 2024
70 checks passed

ywang96 mentioned this pull request Jul 4, 2024

[VLM] Improve consistency between feature size calculation and dummy data for profiling #6146

Merged

DarkLight1337 deleted the max-tokens branch July 5, 2024 01:53

DarkLight1337 mentioned this pull request Jul 5, 2024

[Model] Initialize deepseek-vl support #5817

Open

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 7, 2024

[VLM] Calculate maximum number of multi-modal tokens by model (vllm-p…

871d250

…roject#6121)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[VLM] Calculate maximum number of multi-modal tokens by model (vllm-p…

0a33b42

…roject#6121)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[VLM] Calculate maximum number of multi-modal tokens by model (vllm-p…

d2bbf49

…roject#6121)

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[VLM] Calculate maximum number of multi-modal tokens by model (vllm-p…

534a090

…roject#6121) Signed-off-by: Alvant <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Calculate maximum number of multi-modal tokens by model #6121

[VLM] Calculate maximum number of multi-modal tokens by model #6121

DarkLight1337 commented Jul 3, 2024 •

edited

Loading

xwjiang2010 Jul 3, 2024

DarkLight1337 Jul 4, 2024 •

edited

Loading

ywang96 Jul 4, 2024

ywang96 left a comment

ywang96 Jul 4, 2024

ywang96 Jul 4, 2024

ywang96 Jul 4, 2024 •

edited

Loading

		@@ -321,6 +321,17 @@ def get_phi3v_image_feature_size(
		+ (new_height // 336 + 1) * 12


		def get_max_phi3v_image_tokens(ctx: InputContext):

[VLM] Calculate maximum number of multi-modal tokens by model #6121

[VLM] Calculate maximum number of multi-modal tokens by model #6121

Conversation

DarkLight1337 commented Jul 3, 2024 • edited Loading

xwjiang2010 Jul 3, 2024

Choose a reason for hiding this comment

DarkLight1337 Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

ywang96 Jul 4, 2024

Choose a reason for hiding this comment

ywang96 left a comment

Choose a reason for hiding this comment

ywang96 Jul 4, 2024

Choose a reason for hiding this comment

ywang96 Jul 4, 2024

Choose a reason for hiding this comment

ywang96 Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

DarkLight1337 commented Jul 3, 2024 •

edited

Loading

DarkLight1337 Jul 4, 2024 •

edited

Loading

ywang96 Jul 4, 2024 •

edited

Loading