Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit b45e048

Browse files
DarkLight1337xwjiang2010ywang96
authored andcommitted
[Core] Dynamic image size support for VLMs (vllm-project#5276)
Signed-off-by: Xiaowei Jiang <[email protected]> Co-authored-by: Xiaowei Jiang <[email protected]> Co-authored-by: ywang96 <[email protected]> Co-authored-by: xwjiang2010 <[email protected]> Co-authored-by: Roger Wang <[email protected]>
1 parent 1f516b8 commit b45e048

38 files changed

+1455
-666
lines changed

docs/source/dev/input_processing/model_inputs_index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Input Processing
88
vLLM provides a mechanism for defining input processors for each model so that the inputs are processed
99
in :class:`~vllm.LLMEngine` before they are passed to model executors.
1010

11-
Currently, this mechanism is only utilized in **multi-modal models** for preprocessing multi-modal input
11+
Currently, this mechanism is only utilized in :ref:`multi-modal models <multi_modality>` for preprocessing multi-modal input
1212
data in addition to input prompt, but it can be extended to text-only language models when needed.
1313

1414
Guides
Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
.. _adding_a_new_multimodal_model:
2+
3+
Adding a New Multimodal Model
4+
=============================
5+
6+
This document provides a high-level guide on integrating a :ref:`multi-modal model <multi_modality>` into vLLM.
7+
8+
.. note::
9+
The complexity of adding a new model depends heavily on the model's architecture.
10+
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
11+
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
12+
13+
.. tip::
14+
If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository.
15+
We will be happy to help you out!
16+
17+
18+
1. Set up the base vLLM model
19+
-----------------------------
20+
21+
As usual, follow :ref:`these steps <adding_a_new_model>` to implement the model in vLLM, but note the following:
22+
23+
- You should additionally implement the :class:`~vllm.model_executor.models.interfaces.SupportsVision` interface.
24+
25+
.. code-block:: diff
26+
27+
+ from vllm.model_executor.models.interfaces import SupportsVision
28+
29+
- class YourModelForImage2Seq(nn.Module):
30+
+ class YourModelForImage2Seq(nn.Module, SupportsVision):
31+
32+
.. note::
33+
The model class does not have to be named :code:`*ForCausalLM`.
34+
Check out `the HuggingFace Transformers documentation <https://huggingface.co/docs/transformers/model_doc/auto#multimodal>`__ for some examples.
35+
36+
- While implementing the :meth:`~torch.nn.Module.forward` method, reserve a keyword parameter
37+
for each input tensor that corresponds to a multi-modal input, as shown in the following example:
38+
39+
.. code-block:: diff
40+
41+
def forward(
42+
self,
43+
input_ids: torch.Tensor,
44+
positions: torch.Tensor,
45+
kv_caches: List[torch.Tensor],
46+
attn_metadata: AttentionMetadata,
47+
+ pixel_values: torch.Tensor,
48+
) -> SamplerOutput:
49+
50+
51+
2. Register input mappers
52+
-------------------------
53+
54+
For each modality type to support, decorate the model class with :meth:`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
55+
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in :meth:`~torch.nn.Module.forward`.
56+
57+
.. code-block:: diff
58+
59+
from vllm.model_executor.models.interfaces import SupportsVision
60+
+ from vllm.multimodal import MULTIMODAL_REGISTRY
61+
62+
+ @MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
63+
+ @MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
64+
class YourModelForImage2Seq(nn.Module, SupportsVision):
65+
66+
A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.
67+
68+
.. seealso::
69+
:ref:`input_processing_pipeline`
70+
71+
72+
3. (Optional) Register dummy data
73+
---------------------------------
74+
75+
During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
76+
In such cases, you can define your own dummy data by registering a factory method via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.
77+
78+
.. code-block:: diff
79+
80+
from vllm.inputs import INPUT_REGISTRY
81+
from vllm.model_executor.models.interfaces import SupportsVision
82+
from vllm.multimodal import MULTIMODAL_REGISTRY
83+
84+
@MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
85+
@MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
86+
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
87+
class YourModelForImage2Seq(nn.Module, SupportsVision):
88+
89+
Here are some examples:
90+
91+
- Image inputs (static feature size): `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
92+
- Image inputs (dynamic feature size): `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__
93+
94+
.. seealso::
95+
:ref:`input_processing_pipeline`
96+
97+
98+
4. (Optional) Register input processor
99+
--------------------------------------
100+
101+
Sometimes, there is a need to process inputs at the :class:`~vllm.LLMEngine` level before they are passed to the model executor.
102+
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's :meth:`~torch.nn.Module.forward` call.
103+
You can register input processors via :meth:`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.
104+
105+
.. code-block:: diff
106+
107+
from vllm.inputs import INPUT_REGISTRY
108+
from vllm.model_executor.models.interfaces import SupportsVision
109+
from vllm.multimodal import MULTIMODAL_REGISTRY
110+
111+
@MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
112+
@MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
113+
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
114+
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
115+
class YourModelForImage2Seq(nn.Module, SupportsVision):
116+
117+
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
118+
Here are some examples:
119+
120+
- Insert static number of image tokens: `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
121+
- Insert dynamic number of image tokens: `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__
122+
123+
.. seealso::
124+
:ref:`input_processing_pipeline`

docs/source/dev/multimodal/multimodal_index.rst

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
.. _multi_modality:
2+
13
Multi-Modality
24
==============
35

@@ -8,12 +10,18 @@ vLLM provides experimental support for multi-modal models through the :mod:`vllm
810
:class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
911
which allows you to pass in multi-modal input alongside text and token prompts.
1012

11-
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model,
12-
you must decorate the model class with :meth:`InputRegistry.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`,
13-
as well as :meth:`MULTIMODAL_REGISTRY.register_input_mapper <MultiModalRegistry.register_input_mapper>` for each modality type to support.
13+
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow :ref:`the guide for adding a new multimodal model. <adding_a_new_multimodal_model>`.
1414

1515
# TODO: Add more instructions on how to do that once embeddings is in.
1616

17+
Guides
18+
++++++
19+
20+
.. toctree::
21+
:maxdepth: 1
22+
23+
adding_multimodal_model
24+
1725
Module Contents
1826
+++++++++++++++
1927

@@ -35,6 +43,10 @@ Base Classes
3543
:members:
3644
:show-inheritance:
3745

46+
.. autoclass:: vllm.multimodal.MultiModalInputs
47+
:members:
48+
:show-inheritance:
49+
3850
.. autoclass:: vllm.multimodal.MultiModalPlugin
3951
:members:
4052
:show-inheritance:

docs/source/models/vlm.rst

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
2323
Currently, the support for vision language models on vLLM has the following limitations:
2424

2525
* Only single image input is supported per text prompt.
26-
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means our LLaVA-NeXT output may not exactly match the huggingface implementation.
2726

2827
We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.
2928

@@ -42,12 +41,17 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
4241
)
4342
4443
.. important::
44+
Currently, you have to specify ``image_feature_size`` to support memory profiling.
45+
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
46+
The calculation of feature size is specific to the model. For more details, please refer to
47+
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
48+
4549
We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
4650

4751

4852
To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
4953

50-
* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
54+
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
5155
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
5256

5357
.. note::
@@ -57,8 +61,8 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
5761

5862
.. code-block:: python
5963
60-
prompt = "<image>" * 576 + (
61-
"\nUSER: What is the content of this image?\nASSISTANT:")
64+
# Refer to the HuggingFace repo for the correct format to use
65+
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
6266
6367
# Load the image using PIL.Image
6468
image = ...
@@ -74,8 +78,6 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
7478
7579
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
7680

77-
.. important::
78-
We will remove the need to format image tokens in a future release. Afterwards, the input text will follow the same format as that for the original HuggingFace model.
7981

8082
Online OpenAI Vision API Compatible Inference
8183
----------------------------------------------
@@ -103,6 +105,11 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
103105
--chat-template template_llava.jinja
104106
105107
.. important::
108+
Currently, you have to specify ``image_feature_size`` to support memory profiling.
109+
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
110+
The calculation of feature size is specific to the model. For more details, please refer to
111+
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
112+
106113
We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
107114

108115
To consume the server, you can use the OpenAI client like in the example below:
@@ -121,6 +128,8 @@ To consume the server, you can use the OpenAI client like in the example below:
121128
messages=[{
122129
"role": "user",
123130
"content": [
131+
# NOTE: The prompt formatting with the image token `<image>` is not needed
132+
# since the prompt will be processed automatically by the API server.
124133
{"type": "text", "text": "What's in this image?"},
125134
{
126135
"type": "image_url",
@@ -144,5 +153,4 @@ A full code example can be found in `examples/openai_vision_api_client.py <https
144153
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
145154
146155
.. note::
147-
The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be
148-
processed automatically by the server.
156+
There is no need to format the prompt in the API request since it will be handled by the server.

examples/llava_example.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,7 @@ def run_llava():
1717
image_feature_size=576,
1818
)
1919

20-
prompt = "<image>" * 576 + (
21-
"\nUSER: What is the content of this image?\nASSISTANT:")
20+
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
2221

2322
image = Image.open("images/stop_sign.jpg")
2423

examples/llava_next_example.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,17 @@
55

66
from vllm import LLM, SamplingParams
77

8-
# Dynamic image input is currently not supported and therefore
9-
# a fixed image input shape and its corresponding feature size is required.
10-
# See https://github.com/vllm-project/vllm/pull/4199 for the complete
11-
# configuration matrix.
12-
138

149
def run_llava_next():
1510
llm = LLM(
1611
model="llava-hf/llava-v1.6-mistral-7b-hf",
1712
image_token_id=32000,
1813
image_input_shape="1,3,336,336",
19-
image_feature_size=1176,
14+
# Use the maximum possible value for memory profiling
15+
image_feature_size=2928,
2016
)
2117

22-
prompt = "[INST] " + "<image>" * 1176 + (
23-
"\nWhat is shown in this image? [/INST]")
18+
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
2419
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
2520
image = Image.open(BytesIO(requests.get(url).content))
2621
sampling_params = SamplingParams(temperature=0.8,

examples/phi3v_example.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@
55

66
from vllm import LLM, SamplingParams
77

8+
# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.
9+
# You can use `.buildkite/download-images.sh` to download them
10+
811

912
def run_phi3v():
1013
model_path = "microsoft/Phi-3-vision-128k-instruct"
@@ -18,16 +21,15 @@ def run_phi3v():
1821
trust_remote_code=True,
1922
image_token_id=32044,
2023
image_input_shape="1,3,1008,1344",
21-
image_feature_size=1921,
24+
# Use the maximum possible value for memory profiling
25+
image_feature_size=2653,
2226
max_num_seqs=5,
2327
)
2428

2529
image = Image.open("images/cherry_blossom.jpg")
2630

2731
# single-image prompt
2832
prompt = "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n" # noqa: E501
29-
prompt = prompt.replace("<|image_1|>", "<|image|>" * 1921 + "<s>")
30-
3133
sampling_params = SamplingParams(temperature=0, max_tokens=64)
3234

3335
outputs = llm.generate(

0 commit comments

Comments
 (0)