[V1] Refactor model executable interface for multimodal models #10570

ywang96 · 2024-11-22T07:50:39Z

This PR refactors the interface of all multimodal language models for V1 VLM re-arch and torch.compile support. In particular, all multimodal model implementations on vLLM will need to meet the following requirements:

get_multimodal_embeddings(**kwargs) implemented in XYZModel or XYZForConditionalGeneration
get_input_embeddings(input_ids, multimodal_embeddings) implemented in XYZModel or XYZForConditionalGeneration to output input embeddings to be passed to the language backbone.
Backward compatibility added for V0 in XYZModel or XYZForConditionalGeneration until V0 is fully deprecated.

List of LMMs to be worked on

This PR is a prerequisite of applying #9871 to all multimodal models on vLLM.

Signed-off-by: Roger Wang <[email protected]>

github-actions · 2024-11-22T07:50:53Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Roger Wang <[email protected]>

ywang96 · 2024-11-25T02:59:01Z

This PR is mostly done but with a few caveats:

For MiniCPMV, the multimodal embedding generation relies on input ids. It's not so clear how it can be separated from input text embedding generation. We need to tag model author to see how to separate them.
For Molmo, the original implementation requires the final multimodal embedding length to match input embedding length becasue of the (+) operation. I'll leave optimizing this to a later PR.
This PR does not consider mllama.

All the other models included in this PR have been tested with example scripts to ensure v0 compatibility. Full v1 compatibility will be worked in the next PR.

vllm/model_executor/models/llava_onevision.py

vllm/model_executor/models/interfaces.py

DarkLight1337 · 2024-11-25T05:18:14Z

I'm getting this error for internvl tests (tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[intern_vl-test_case52]):

self = InternVLChatModel(
  (vision_model): InternVisionPatchModel(
    (embeddings): InternVisionEmbeddings(
      (patch_em...048, bias=True)
    (2): GELU(approximate='none')
    (3): Linear(in_features=2048, out_features=2048, bias=True)
  )
)
input_ids = tensor([92546, 92546, 92546,  ..., 92546, 92546, 92546], device='cuda:0')

    def _get_visual_token_mask(self, input_ids: torch.Tensor) -> torch.Tensor:
        if self.is_mono:
>           visual_token_mask = (
                input_ids == self.img_context_token_id).reshape(-1, 1)
E           AttributeError: 'bool' object has no attribute 'reshape'

vllm/model_executor/models/internvl.py:639: AttributeError

I think it's because self.img_context_token_id hasn't been set yet, so it resolves to None in the comparison expression.

vllm/model_executor/models/internvl.py

mergify · 2024-11-26T06:56:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ywang96.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Roger Wang <[email protected]>

ywang96 · 2024-11-26T08:45:15Z

I'm getting this error for internvl tests (tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[intern_vl-test_case52]):

self = InternVLChatModel(
  (vision_model): InternVisionPatchModel(
    (embeddings): InternVisionEmbeddings(
      (patch_em...048, bias=True)
    (2): GELU(approximate='none')
    (3): Linear(in_features=2048, out_features=2048, bias=True)
  )
)
input_ids = tensor([92546, 92546, 92546,  ..., 92546, 92546, 92546], device='cuda:0')

    def _get_visual_token_mask(self, input_ids: torch.Tensor) -> torch.Tensor:
        if self.is_mono:
>           visual_token_mask = (
                input_ids == self.img_context_token_id).reshape(-1, 1)
E           AttributeError: 'bool' object has no attribute 'reshape'

vllm/model_executor/models/internvl.py:639: AttributeError

I think it's because self.img_context_token_id hasn't been set yet, so it resolves to None in the comparison expression.

I pushed a change which I think should be a clean fix for this. Let me know what you think!

Signed-off-by: Roger Wang <[email protected]>

vllm/model_executor/models/interfaces.py

Signed-off-by: Roger Wang <[email protected]>

vllm/model_executor/models/interfaces.py

Signed-off-by: Roger Wang <[email protected]>

…project#10570) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>

…project#10570) Signed-off-by: Roger Wang <[email protected]>

ywang96 added 8 commits November 21, 2024 22:26

rename

9b663c0

Signed-off-by: Roger Wang <[email protected]>

blip2

757dad2

Signed-off-by: Roger Wang <[email protected]>

chameleon

49eb639

Signed-off-by: Roger Wang <[email protected]>

fix

036b1a6

Signed-off-by: Roger Wang <[email protected]>

fix

124e7bc

Signed-off-by: Roger Wang <[email protected]>

fix

da9551b

Signed-off-by: Roger Wang <[email protected]>

typing

beaf535

Signed-off-by: Roger Wang <[email protected]>

glmv

ab3c7b7

Signed-off-by: Roger Wang <[email protected]>

This was referenced Nov 22, 2024

[RFC]: Multi-modality Support Refactoring #4194

Open

[Model]: Add support for Aria model #10514

Merged

ywang96 added 6 commits November 22, 2024 00:03

interface

ea3cd5b

Signed-off-by: Roger Wang <[email protected]>

fuyu

089762d

Signed-off-by: Roger Wang <[email protected]>

internvl

4aa7848

Signed-off-by: Roger Wang <[email protected]>

llava-next

1825fbb

Signed-off-by: Roger Wang <[email protected]>

llava-next-video

e5e3368

Signed-off-by: Roger Wang <[email protected]>

llava-ov

42912b8

Signed-off-by: Roger Wang <[email protected]>

DarkLight1337 self-assigned this Nov 23, 2024

ywang96 added 8 commits November 23, 2024 07:38

molmo

d83ba1e

Signed-off-by: Roger Wang <[email protected]>

rename interface

f4b7747

Signed-off-by: Roger Wang <[email protected]>

paligemma

4be9ab2

Signed-off-by: Roger Wang <[email protected]>

qwen2vl

20709c6

Signed-off-by: Roger Wang <[email protected]>

qwen2_audio

df1494f

Signed-off-by: Roger Wang <[email protected]>

ultravox

45d8e0a

Signed-off-by: Roger Wang <[email protected]>

typing

56b31e7

Signed-off-by: Roger Wang <[email protected]>

comment

1de5b2b

Signed-off-by: Roger Wang <[email protected]>

Merge branch 'vllm-project:main' into vlm-interface

a49fd1b

ywang96 marked this pull request as ready for review November 25, 2024 03:01

ywang96 requested a review from WoosukKwon as a code owner November 25, 2024 03:01

ywang96 commented Nov 25, 2024

View reviewed changes

vllm/model_executor/models/llava_onevision.py Outdated Show resolved Hide resolved

ywang96 requested review from DarkLight1337 and Isotr0py November 25, 2024 03:07

DarkLight1337 reviewed Nov 25, 2024

View reviewed changes

vllm/model_executor/models/interfaces.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Nov 25, 2024

View reviewed changes

vllm/model_executor/models/internvl.py Outdated Show resolved Hide resolved

Merge branch 'vllm-project:main' into vlm-interface

a49cb0e

mergify bot added the needs-rebase label Nov 26, 2024

Merge branch 'main' into vlm-interface

5d726de

mergify bot removed the needs-rebase label Nov 26, 2024

ywang96 added 2 commits November 26, 2024 07:05

update interface changes

4fd91d0

Signed-off-by: Roger Wang <[email protected]>

fix internvl

f8e76b8

Signed-off-by: Roger Wang <[email protected]>

generic typing

e6a0f53

Signed-off-by: Roger Wang <[email protected]>

DarkLight1337 reviewed Nov 26, 2024

View reviewed changes

vllm/model_executor/models/interfaces.py Outdated Show resolved Hide resolved

using typing extentions

59407eb

Signed-off-by: Roger Wang <[email protected]>

DarkLight1337 reviewed Nov 26, 2024

View reviewed changes

vllm/model_executor/models/interfaces.py Outdated Show resolved Hide resolved

DarkLight1337 and others added 4 commits November 27, 2024 00:51

Fix fake import

e4f23e4

add TODO for mixed-modality

bdf5a4f

Signed-off-by: Roger Wang <[email protected]>

ignore f401

c2f29f7

Signed-off-by: Roger Wang <[email protected]>

format

aa9f804

Signed-off-by: Roger Wang <[email protected]>

DarkLight1337 approved these changes Nov 26, 2024

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 26, 2024

ywang96 enabled auto-merge (squash) November 26, 2024 17:42

ywang96 merged commit 2f0a0a1 into vllm-project:main Nov 26, 2024
60 of 63 checks passed

This was referenced Nov 26, 2024

[V1] Update interface for idefics3 #10680

Merged

[V1][VLM] Enable proper chunked prefill for multimodal models #9950

Closed

afeldman-nm pushed a commit to neuralmagic/vllm that referenced this pull request Dec 2, 2024

[V1] Refactor model executable interface for multimodal models (vllm-…

1f6d7d2

…project#10570) Signed-off-by: Roger Wang <[email protected]> Signed-off-by: Andrew Feldman <[email protected]>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[V1] Refactor model executable interface for multimodal models (vllm-…

6be5b50

…project#10570) Signed-off-by: Roger Wang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Refactor model executable interface for multimodal models #10570

[V1] Refactor model executable interface for multimodal models #10570

ywang96 commented Nov 22, 2024 •

edited

Loading

github-actions bot commented Nov 22, 2024

ywang96 commented Nov 25, 2024 •

edited

Loading

DarkLight1337 commented Nov 25, 2024 •

edited

Loading

mergify bot commented Nov 26, 2024

ywang96 commented Nov 26, 2024

[V1] Refactor model executable interface for multimodal models #10570

[V1] Refactor model executable interface for multimodal models #10570

Conversation

ywang96 commented Nov 22, 2024 • edited Loading

github-actions bot commented Nov 22, 2024

ywang96 commented Nov 25, 2024 • edited Loading

DarkLight1337 commented Nov 25, 2024 • edited Loading

mergify bot commented Nov 26, 2024

ywang96 commented Nov 26, 2024

ywang96 commented Nov 22, 2024 •

edited

Loading

ywang96 commented Nov 25, 2024 •

edited

Loading

DarkLight1337 commented Nov 25, 2024 •

edited

Loading