Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Commit

Permalink
[vlm] Remove vision language config. (vllm-project#6089)
Browse files Browse the repository at this point in the history
Signed-off-by: Xiaowei Jiang <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
  • Loading branch information
2 people authored and robertgshaw2-neuralmagic committed Jul 7, 2024
1 parent add900f commit 255f3ed
Show file tree
Hide file tree
Showing 43 changed files with 372 additions and 466 deletions.
5 changes: 5 additions & 0 deletions docs/source/dev/multimodal/multimodal_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,13 @@ vLLM provides experimental support for multi-modal models through the :mod:`vllm
:class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
which allows you to pass in multi-modal input alongside text and token prompts.

.. note::
``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
:class:`vllm.multimodal.MULTIMODAL_REGISTRY`.

By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow :ref:`the guide for adding a new multimodal model. <adding_a_new_multimodal_model>`.


# TODO: Add more instructions on how to do that once embeddings is in.

Guides
Expand Down
78 changes: 39 additions & 39 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,18 +8,6 @@ vLLM provides experimental support for Vision Language Models (VLMs). This docum
.. important::
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.

Engine Arguments
----------------

The following :ref:`engine arguments <engine_args>` are specific to VLMs:

.. argparse::
:module: vllm.engine.arg_utils
:func: _vlm_engine_args_parser
:prog: -m vllm.entrypoints.openai.api_server
:nodefaultconst:

.. important::
Currently, the support for vision language models on vLLM has the following limitations:

* Only single image input is supported per text prompt.
Expand All @@ -33,40 +21,33 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``

.. code-block:: python
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
)
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
.. important::
Currently, you have to specify ``image_feature_size`` to support memory profiling.
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
The calculation of feature size is specific to the model. For more details, please refer to
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
every model to perform profiling with.

We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through
:meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>`
for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced
with a more accurate profiling strategy in the future.


To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:

* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.

.. note::

``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
:class:`vllm.multimodal.MULTIMODAL_REGISTRY`.

.. code-block:: python
# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Load the image using PIL.Image
image = ...
image = PIL.Image.open(...)
# Single prompt inference
outputs = llm.generate({
"prompt": prompt,
"multi_modal_data": {"image": image},
Expand All @@ -75,6 +56,26 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# Batch inference
image_1 = PIL.Image.open(...)
image_2 = PIL.Image.open(...)
outputs = llm.generate(
[
{
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_1},
},
{
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
"multi_modal_data": {"image": image_2},
}
]
)
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.

Expand All @@ -99,18 +100,17 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576 \
--chat-template template_llava.jinja
.. important::
Currently, you have to specify ``image_feature_size`` to support memory profiling.
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
The calculation of feature size is specific to the model. For more details, please refer to
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.

We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
every model to perform profiling with.

This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through
:meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>`
for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced
with a more accurate profiling strategy in the future.

To consume the server, you can use the OpenAI client like in the example below:

Expand Down
7 changes: 1 addition & 6 deletions examples/llava_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,7 @@


def run_llava():
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
)
llm = LLM(model="llava-hf/llava-1.5-7b-hf")

prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

Expand Down
8 changes: 1 addition & 7 deletions examples/llava_next_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,7 @@


def run_llava_next():
llm = LLM(
model="llava-hf/llava-v1.6-mistral-7b-hf",
image_token_id=32000,
image_input_shape="1,3,336,336",
# Use the maximum possible value for memory profiling
image_feature_size=2928,
)
llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)

prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
Expand Down
3 changes: 0 additions & 3 deletions examples/openai_vision_api_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,6 @@
Launch the vLLM server with the following command:
python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576 \
--chat-template template_llava.jinja
"""
import base64
Expand Down
6 changes: 2 additions & 4 deletions examples/phi3v_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,13 @@ def run_phi3v():

# Note: The default setting of max_num_seqs (256) and
# max_model_len (128k) for this model may cause OOM.
# You may lower either to run this example on lower-end GPUs.

# In this example, we override max_num_seqs to 5 while
# keeping the original context length of 128k.
llm = LLM(
model=model_path,
trust_remote_code=True,
image_token_id=32044,
image_input_shape="1,3,1008,1344",
# Use the maximum possible value for memory profiling
image_feature_size=2653,
max_num_seqs=5,
)

Expand Down
6 changes: 3 additions & 3 deletions tests/distributed/test_multimodal_broadcast.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@
model = os.environ["TEST_DIST_MODEL"]

if model.startswith("llava-hf/llava"):
from ..models.test_llava import model_and_vl_config, run_test
from ..models.test_llava import models, run_test
elif model.startswith("microsoft/Phi-3-vision"):
from ..models.test_phi3v import model_and_vl_config, run_test
from ..models.test_phi3v import models, run_test
else:
raise NotImplementedError(f"Unsupported model: {model}")

Expand All @@ -49,7 +49,7 @@ def test_models(hf_runner, vllm_runner, image_assets,
hf_runner,
vllm_runner,
image_assets,
model_and_config=model_and_vl_config[0],
model=models[0],
size_factors=[1.0],
dtype=dtype,
max_tokens=max_tokens,
Expand Down
6 changes: 0 additions & 6 deletions tests/entrypoints/openai/test_vision.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,6 @@ def server(ray_ctx):
"--max-model-len",
"4096",
"--enforce-eager",
"--image-token-id",
"32000",
"--image-input-shape",
"1,3,336,336",
"--image-feature-size",
"576",
"--chat-template",
str(LLAVA_CHAT_TEMPLATE),
])
Expand Down
60 changes: 17 additions & 43 deletions tests/models/test_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
from transformers import AutoTokenizer

from tests.nm_utils.utils_skip import should_skip_test_group
from vllm.config import VisionLanguageConfig
from vllm.multimodal.utils import rescale_image_size
from vllm.sequence import SampleLogprobs

Expand All @@ -26,49 +25,27 @@
"USER: <image>\nWhat's in this image?\nASSISTANT:",
})

IMAGE_TOKEN_ID = 32000

def iter_llava_configs(model_name: str):
image_hw_to_feature_size = {
(336, 336): 576,
}

for (h, w), f in image_hw_to_feature_size.items():
input_shape = (1, 3, h, w)
yield (model_name,
VisionLanguageConfig(image_feature_size=f,
image_token_id=32000,
image_input_shape=input_shape))


model_and_vl_config = [
*iter_llava_configs("llava-hf/llava-1.5-7b-hf"),
]
models = ["llava-hf/llava-1.5-7b-hf"]


def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
Optional[SampleLogprobs]],
vlm_config: VisionLanguageConfig, model_id: str):
"""Sanitize vllm output to be comparable with hf output.
The function reduces `input_ids` from 1, 32000, 32000, ..., 32000,
x1, x2, x3 ... to 1, 32000, x1, x2, x3 ...
It also reduces `output_str` from "<image><image>bla" to "bla".
"""
model: str):
"""Sanitize vllm output to be comparable with hf output."""
output_ids, output_str, out_logprobs = vllm_output
image_token_id = vlm_config.image_token_id

tokenizer = AutoTokenizer.from_pretrained(model_id)
image_token_str = tokenizer.decode(image_token_id)
tokenizer = AutoTokenizer.from_pretrained(model)
eos_token_id = tokenizer.eos_token_id

hf_output_ids = [
token_id for idx, token_id in enumerate(output_ids)
if token_id != image_token_id or output_ids[idx - 1] != image_token_id
if token_id != IMAGE_TOKEN_ID or output_ids[idx - 1] != IMAGE_TOKEN_ID
]

hf_output_str = output_str \
.replace(image_token_str * vlm_config.image_feature_size, "")
assert hf_output_str[0] == " "
hf_output_str = hf_output_str[1:]
assert output_str[0] == " "
hf_output_str = output_str[1:]
if hf_output_ids[-1] == eos_token_id:
hf_output_str = hf_output_str + tokenizer.decode(eos_token_id)

Expand All @@ -79,7 +56,7 @@ def run_test(
hf_runner: Type[HfRunner],
vllm_runner: Type[VllmRunner],
image_assets: _ImageAssets,
model_and_config: Tuple[str, VisionLanguageConfig],
model: str,
*,
size_factors: List[float],
dtype: str,
Expand All @@ -97,7 +74,6 @@ def run_test(
Note, the text input is also adjusted to abide by vllm contract.
The text output is sanitized to be able to compare with hf.
"""
model_id, vlm_config = model_and_config
images = [asset.pil_image for asset in image_assets]

inputs_per_image = [(
Expand All @@ -111,12 +87,11 @@ def run_test(
# will hurt multiprocessing backend with fork method (the default method).

# max_model_len should be greater than image_feature_size
with vllm_runner(model_id,
with vllm_runner(model,
dtype=dtype,
tensor_parallel_size=tensor_parallel_size,
distributed_executor_backend=distributed_executor_backend,
enforce_eager=True,
**vlm_config.as_cli_args_dict()) as vllm_model:
enforce_eager=True) as vllm_model:
vllm_outputs_per_image = [
vllm_model.generate_greedy_logprobs(prompts,
max_tokens,
Expand All @@ -125,7 +100,7 @@ def run_test(
for prompts, images in inputs_per_image
]

with hf_runner(model_id, dtype=dtype, is_vision_model=True) as hf_model:
with hf_runner(model, dtype=dtype, is_vision_model=True) as hf_model:
hf_outputs_per_image = [
hf_model.generate_greedy_logprobs_limit(prompts,
max_tokens,
Expand All @@ -141,15 +116,15 @@ def run_test(
check_logprobs_close(
outputs_0_lst=hf_outputs,
outputs_1_lst=[
vllm_to_hf_output(vllm_output, vlm_config, model_id)
vllm_to_hf_output(vllm_output, model)
for vllm_output in vllm_outputs
],
name_0="hf",
name_1="vllm",
)


@pytest.mark.parametrize("model_and_config", model_and_vl_config)
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
"size_factors",
[
Expand All @@ -166,14 +141,13 @@ def run_test(
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [128])
@pytest.mark.parametrize("num_logprobs", [5])
def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
size_factors, dtype: str, max_tokens: int,
num_logprobs: int) -> None:
def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
dtype: str, max_tokens: int, num_logprobs: int) -> None:
run_test(
hf_runner,
vllm_runner,
image_assets,
model_and_config,
model,
size_factors=size_factors,
dtype=dtype,
max_tokens=max_tokens,
Expand Down
Loading

0 comments on commit 255f3ed

Please sign in to comment.