Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vlm] Remove vision language config. #6089

Merged
merged 11 commits into from
Jul 3, 2024
39 changes: 14 additions & 25 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,6 @@ vLLM provides experimental support for Vision Language Models (VLMs). This docum
.. important::
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.

Engine Arguments
----------------

The following :ref:`engine arguments <engine_args>` are specific to VLMs:

.. argparse::
:module: vllm.engine.arg_utils
:func: _vlm_engine_args_parser
:prog: -m vllm.entrypoints.openai.api_server
:nodefaultconst:

.. important::
Currently, the support for vision language models on vLLM has the following limitations:

* Only single image input is supported per text prompt.
Expand All @@ -36,12 +24,14 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
llm = LLM(model="llava-hf/llava-1.5-7b-hf")

.. important::
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved
Currently, you have to specify ``image_feature_size`` to support memory profiling.
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
The calculation of feature size is specific to the model. For more details, please refer to
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
We have removed all vision language related cli args in the new release. This is a breaking change, so please update your code to follow
the above snippet.

Specifically, no need to specify `image_feature_size` for profiling purposes anymore. Internally we will construct data structures for
every model to do profiling with, to hide our users from this detail at the API layer.

We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
This work is still ongoing. In the meantime, we internally hardcode `image_feature_size = 3000` for every model to be conservative
in terms of GPU memory consumption. This hardcode value will be removed and replaced with a more accurate profiling strategy.


To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
Expand Down Expand Up @@ -94,18 +84,17 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with

python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576 \
--chat-template template_llava.jinja

.. important::
Currently, you have to specify ``image_feature_size`` to support memory profiling.
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
The calculation of feature size is specific to the model. For more details, please refer to
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
We have removed all vision language related cli args in the new release. This is a breaking change, so please update your code to follow
the above snippet.

Specifically, no need to specify `image_feature_size` for profiling purposes anymore. Internally we will construct data structures for
every model to do profiling with, to hide our users from this detail at the API layer.

We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
This work is still ongoing. In the meantime, we internally hardcode `image_feature_size = 3000` for every model to be conservative
in terms of GPU memory consumption. This hardcode value will be removed and replaced with a more accurate profiling strategy.
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved

To consume the server, you can use the OpenAI client like in the example below:

Expand Down
3 changes: 2 additions & 1 deletion examples/phi3v_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ def run_phi3v():
llm = LLM(
model=model_path,
trust_remote_code=True,
max_num_seqs=1,
max_num_seqs=5,
max_model_len=4096,
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved
)

image = Image.open("images/cherry_blossom.jpg")
Expand Down
6 changes: 0 additions & 6 deletions tests/entrypoints/openai/test_vision.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,12 +39,6 @@ def server(ray_ctx):
"--max-model-len",
"4096",
"--enforce-eager",
"--image-token-id",
"32000",
"--image-input-shape",
"1,3,336,336",
"--image-feature-size",
"576",
"--chat-template",
str(LLAVA_CHAT_TEMPLATE),
])
Expand Down
2 changes: 1 addition & 1 deletion tests/models/test_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def run_test(
hf_runner: Type[HfRunner],
vllm_runner: Type[VllmRunner],
image_assets: _ImageAssets,
model,
model: str,
*,
size_factors: List[float],
dtype: str,
Expand Down
2 changes: 1 addition & 1 deletion tests/models/test_llava_next.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [128])
@pytest.mark.parametrize("num_logprobs", [5])
def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
def test_models(hf_runner, vllm_runner, image_assets, model: str, size_factors,
DarkLight1337 marked this conversation as resolved.
Show resolved Hide resolved
dtype: str, max_tokens: int, num_logprobs: int) -> None:
"""Inference result should be the same between hf and vllm.

Expand Down
12 changes: 6 additions & 6 deletions tests/models/test_phi3v.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,14 +133,14 @@ def run_test(
@pytest.mark.parametrize(
"size_factors",
[
# # No image
# [],
# No image
[],
# Single-scale
[1.0],
# # Single-scale, batched
# [1.0, 1.0, 1.0],
# # Multi-scale
# [0.25, 0.5, 1.0],
# Single-scale, batched
[1.0, 1.0, 1.0],
# Multi-scale
[0.25, 0.5, 1.0],
],
)
@pytest.mark.parametrize("dtype", [target_dtype])
Expand Down
11 changes: 5 additions & 6 deletions vllm/entrypoints/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,12 +121,11 @@ def __init__(
) -> None:
if "disable_log_stats" not in kwargs:
kwargs["disable_log_stats"] = True
deprecated_vision_keys = [
"image_token_id", "image_feature_size", "image_input_shape",
"image_input_type"
]
if any(k in kwargs for k in deprecated_vision_keys):
raise TypeError("vision_language_config is deprecated. See ")
removed_vision_keys = ("image_token_id", "image_feature_size",
"image_input_shape", "image_input_type")
if any(k in kwargs for k in removed_vision_keys):
raise TypeError(
"There is no need to pass vision-related arguments anymore.")
engine_args = EngineArgs(
model=model,
tokenizer=tokenizer,
Expand Down
7 changes: 7 additions & 0 deletions vllm/multimodal/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,10 @@ def create_input_mapper(self, model_config: ModelConfig):
Create an input mapper (see :meth:`map_input`) for a specific model.
"""
return functools.partial(self.map_input, model_config)

def get_num_input_tokens(self):
"""
Get the number of input tokens for profiling purposes.
"""
# TODO: Provide this number on a per model basis.
return 3000
2 changes: 0 additions & 2 deletions vllm/worker/cpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -366,8 +366,6 @@ def execute_model(
"attn_metadata": model_input.attn_metadata,
**(model_input.multi_modal_kwargs or {}),
}
if model_input.multi_modal_kwargs:
execute_model_kwargs.update(model_input.multi_modal_kwargs)

hidden_states = model_executable(**execute_model_kwargs)

Expand Down
2 changes: 0 additions & 2 deletions vllm/worker/embedding_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,6 @@ def execute_model(
"attn_metadata": model_input.attn_metadata,
**(model_input.multi_modal_kwargs or {}),
}
if model_input.multi_modal_kwargs:
execute_model_kwargs.update(model_input.multi_modal_kwargs)

hidden_states = model_executable(**execute_model_kwargs)

Expand Down
8 changes: 5 additions & 3 deletions vllm/worker/model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -810,10 +810,12 @@ def profile_run(self) -> None:
model_config = self.model_config

if supports_vision(self.model):
# TODO: properly inject these numbers from MultiModalRegistry.
# Right now, just use an overly conservative number.
max_num_seqs = max(
xwjiang2010 marked this conversation as resolved.
Show resolved Hide resolved
1, min(max_num_seqs, int(max_num_batched_tokens / 3000)))
1,
min(
max_num_seqs,
int(max_num_batched_tokens /
MULTIMODAL_REGISTRY.get_num_input_tokens())))
batch_size = 0
for group_id in range(max_num_seqs):
seq_len = (max_num_batched_tokens // max_num_seqs +
Expand Down
6 changes: 5 additions & 1 deletion vllm/worker/xpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,11 @@ def profile_run(self) -> None:
# TODO: properly inject these numbers from MultiModalRegistry.
# Right now, just use an overly conservative number.
max_num_seqs = max(
1, min(max_num_seqs, int(max_num_batched_tokens / 3000)))
1,
min(
max_num_seqs,
int(max_num_batched_tokens /
MULTIMODAL_REGISTRY.get_num_input_tokens())))

for group_id in range(max_num_seqs):
seq_len = (max_num_batched_tokens // max_num_seqs +
Expand Down