Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model][VLM] Add Qwen2-VL model support #7905

Merged
merged 44 commits into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
0a648b2
Add support to Qwen2-VL.
fyabc Aug 23, 2024
320df57
Merge branch 'refs/heads/main' into add_qwen2_vl_new
fyabc Aug 26, 2024
7f96df8
Reformat
fyabc Aug 27, 2024
fbf2b8b
Merge branch 'refs/heads/main' into add_qwen2_vl_new
fyabc Aug 27, 2024
bcaff4f
Update transformers link.
fyabc Aug 27, 2024
f2185bf
Bugfix of mrope_input_positions in model_runner.py.
fyabc Aug 27, 2024
60448cb
Rename pixel_values_video to pixel_values_videos in qwen2_vl.py.
fyabc Aug 27, 2024
71a77b1
Fix the bug of MultiModalInputs.batch() when passing different modali…
fyabc Aug 27, 2024
60c4cbd
Fix the bug when running OpenAI-compatible API server.
fyabc Aug 27, 2024
e29ff54
Merge branch 'refs/heads/main' into add_qwen2_vl_new
fyabc Aug 29, 2024
ddb7138
Refactor qwen2_vl.py based on review comments.
fyabc Aug 29, 2024
14fe12a
reformat
fyabc Aug 29, 2024
89def23
reformat
fyabc Aug 29, 2024
e721e60
Fix the bug of model_is_mrope in model_runner.py.
fyabc Aug 29, 2024
d66d167
fix type hints in qwen2_vl.py
fyabc Aug 29, 2024
acd85ed
Update mm input processors according to new MultiModalInput.batch() i…
fyabc Aug 29, 2024
8d762c6
Merge branch 'refs/heads/main' into add_qwen2_vl_new
fyabc Aug 30, 2024
87ba5ed
Fix SamplerOutput.
fyabc Aug 30, 2024
cda300a
Fix bug of quantization.
fyabc Aug 30, 2024
da03a3f
Bugfix of type hints in qwen2_vl.py.
fyabc Aug 31, 2024
25fb189
reformat.
fyabc Aug 31, 2024
d01530d
Merge branch 'main' into add_qwen2_vl_new
ywang96 Sep 1, 2024
faebfe4
fix typo from resolving conflict
ywang96 Sep 1, 2024
e492e53
Merge branch 'refs/heads/main' into add_qwen2_vl_new
fyabc Sep 2, 2024
2e87db7
Bugfix in qwen2_vl.py.
fyabc Sep 2, 2024
39a1069
Adding xformers implementation
fyabc Sep 5, 2024
855c78b
Fix bug of attn_bias in xformers implementation
fyabc Sep 5, 2024
091983f
Fix bug in xformers implementation, and add backend check in vision a…
fyabc Sep 6, 2024
b406571
Merge branch 'refs/heads/main' into add_qwen2_vl_new
fyabc Sep 6, 2024
7739588
Bugfix in qwen2_vl.py.
fyabc Sep 6, 2024
5bab9ba
Bugfix in qwen2_vl.py.
fyabc Sep 6, 2024
4587346
reformat.
fyabc Sep 6, 2024
ffad79f
Refactor MRotaryEmbedding.
fyabc Sep 6, 2024
9e7a946
Merge branch 'refs/heads/main' into add_qwen2_vl_new
fyabc Sep 9, 2024
d527417
Add "video" into ModalityStr.
fyabc Sep 9, 2024
6f3116c
Add Qwen2-VL examples.
fyabc Sep 9, 2024
386f302
Optimizer Qwen2-VL input processor. Update document.
fyabc Sep 10, 2024
c64c217
Update model notes and requirements-common.txt.
fyabc Sep 10, 2024
6bdefd6
Update model notes.
fyabc Sep 10, 2024
33dd048
Skip loading model
DarkLight1337 Sep 11, 2024
369ce7d
Merge branch 'main' into add_qwen2_vl_new
DarkLight1337 Sep 11, 2024
282c66a
format
DarkLight1337 Sep 11, 2024
14ef94d
Increase `max_model_len` to fit the original image
DarkLight1337 Sep 11, 2024
09b7a4f
Merge branch 'main' into add_qwen2_vl_new
DarkLight1337 Sep 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,11 @@ Multimodal Language Models
- Image\ :sup:`E`
- :code:`Qwen/Qwen-VL`, :code:`Qwen/Qwen-VL-Chat`, etc.
-
* - :code:`Qwen2VLForConditionalGeneration`
- Qwen2-VL (see note)
- Image\ :sup:`+` / Video\ :sup:`+`
- :code:`Qwen/Qwen2-VL-2B-Instruct`, :code:`Qwen/Qwen2-VL-7B-Instruct`, :code:`Qwen/Qwen2-VL-72B-Instruct`, etc.
DarkLight1337 marked this conversation as resolved.
Show resolved Hide resolved
-
* - :code:`UltravoxModel`
- Ultravox
- Audio\ :sup:`E+`
Expand All @@ -260,6 +265,10 @@ Multimodal Language Models
For :code:`openbmb/MiniCPM-V-2`, the official repo doesn't work yet, so we need to use a fork (:code:`HwwwH/MiniCPM-V-2`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/4087#issuecomment-2250397630

.. note::
For :code:`Qwen2-VL`, the latest release of :code:`huggingface/transformers` doesn't work yet, so we need to use a developer version (:code:`21fac7abba2a37fae86106f87fcf9974fd1e3830`) for now.
For more details, please see: https://github.com/vllm-project/vllm/pull/7905#issuecomment-2339863055

----

If your model uses one of the above model architectures, you can seamlessly run your model with vLLM.
Expand Down
18 changes: 18 additions & 0 deletions examples/offline_inference_vision_language.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,23 @@ def run_qwen_vl(question):
return llm, prompt, stop_token_ids


# Qwen2-VL
def run_qwen2_vl(question):
model_name = "Qwen/Qwen2-VL-7B-Instruct"

llm = LLM(
model=model_name,
max_num_seqs=5,
)

prompt = ("<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
f"{question}<|im_end|>\n"
"<|im_start|>assistant\n")
stop_token_ids = None
return llm, prompt, stop_token_ids


model_example_map = {
"llava": run_llava,
"llava-next": run_llava_next,
Expand All @@ -184,6 +201,7 @@ def run_qwen_vl(question):
"blip-2": run_blip2,
"internvl_chat": run_internvl,
"qwen_vl": run_qwen_vl,
"qwen2_vl": run_qwen2_vl,
}


Expand Down
68 changes: 61 additions & 7 deletions examples/offline_inference_vision_language_multi_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
from argparse import Namespace
from typing import List

from transformers import AutoTokenizer
from transformers import AutoProcessor, AutoTokenizer

from vllm import LLM, SamplingParams
from vllm.multimodal.utils import fetch_image
Expand All @@ -30,7 +30,7 @@ def load_phi3v(question, image_urls: List[str]):
for i, _ in enumerate(image_urls, start=1))
prompt = f"<|user|>\n{placeholders}\n{question}<|end|>\n<|assistant|>\n"
stop_token_ids = None
return llm, prompt, stop_token_ids
return llm, prompt, stop_token_ids, None


def load_internvl(question, image_urls: List[str]):
Expand Down Expand Up @@ -60,18 +60,72 @@ def load_internvl(question, image_urls: List[str]):
# https://huggingface.co/OpenGVLab/InternVL2-2B#service
stop_tokens = ["<|endoftext|>", "<|im_start|>", "<|im_end|>", "<|end|>"]
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]
return llm, prompt, stop_token_ids

return llm, prompt, stop_token_ids, None


def load_qwen2_vl(question, image_urls: List[str]):
try:
from qwen_vl_utils import process_vision_info
except ModuleNotFoundError:
print('WARNING: `qwen-vl-utils` not installed, input images will not '
'be automatically resized. You can enable this functionality by '
'`pip install qwen-vl-utils`.')
process_vision_info = None

model_name = "Qwen/Qwen2-VL-7B-Instruct"

llm = LLM(
model=model_name,
max_num_seqs=5,
max_model_len=4096,
limit_mm_per_prompt={"image": len(image_urls)},
)

placeholders = [{"type": "image", "image": url} for url in image_urls]
messages = [{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role":
"user",
"content": [
*placeholders,
{
"type": "text",
"text": question
},
],
}]

processor = AutoProcessor.from_pretrained(model_name)

prompt = processor.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)

stop_token_ids = None

if process_vision_info is None:
image_data = [fetch_image(url) for url in image_urls]
else:
image_data, _ = process_vision_info(messages)

return llm, prompt, stop_token_ids, image_data


model_example_map = {
"phi3_v": load_phi3v,
"internvl_chat": load_internvl,
"qwen2_vl": load_qwen2_vl,
}


def run_generate(model, question: str, image_urls: List[str]):
llm, prompt, stop_token_ids = model_example_map[model](question,
image_urls)
llm, prompt, stop_token_ids, image_data = model_example_map[model](
question, image_urls)
if image_data is None:
image_data = [fetch_image(url) for url in image_urls]

sampling_params = SamplingParams(temperature=0.0,
max_tokens=128,
Expand All @@ -81,7 +135,7 @@ def run_generate(model, question: str, image_urls: List[str]):
{
"prompt": prompt,
"multi_modal_data": {
"image": [fetch_image(url) for url in image_urls]
"image": image_data
},
},
sampling_params=sampling_params)
Expand All @@ -92,7 +146,7 @@ def run_generate(model, question: str, image_urls: List[str]):


def run_chat(model: str, question: str, image_urls: List[str]):
llm, _, stop_token_ids = model_example_map[model](question, image_urls)
llm, _, stop_token_ids, _ = model_example_map[model](question, image_urls)

sampling_params = SamplingParams(temperature=0.0,
max_tokens=128,
Expand Down
1 change: 1 addition & 0 deletions requirements-common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ importlib_metadata
mistral_common >= 1.3.4
pyyaml
six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12
einops # Required for Qwen2-VL.
9 changes: 6 additions & 3 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -773,7 +773,7 @@ class LoadConfig:
ignore_patterns: The list of patterns to ignore when loading the model.
Default to "original/**/*" to avoid repeated loading of llama's
checkpoints.

"""

load_format: Union[str, LoadFormat, "BaseModelLoader"] = LoadFormat.AUTO
Expand Down Expand Up @@ -1741,8 +1741,11 @@ def _get_and_verify_max_len(
"with rope_scaling. Please raise an issue so we can "
"investigate.")

assert "factor" in rope_scaling
scaling_factor = rope_scaling["factor"]
if rope_type == "mrope":
fyabc marked this conversation as resolved.
Show resolved Hide resolved
scaling_factor = 1
else:
assert "factor" in rope_scaling
scaling_factor = rope_scaling["factor"]
if rope_type == "yarn":
derived_max_model_len = rope_scaling[
"original_max_position_embeddings"]
Expand Down
8 changes: 7 additions & 1 deletion vllm/entrypoints/chat_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ class ConversationMessage(TypedDict, total=False):
"""The tool calls generated by the model, such as function calls."""


ModalityStr = Literal["image", "audio"]
ModalityStr = Literal["image", "audio", "video"]
_T = TypeVar("_T")


Expand Down Expand Up @@ -157,12 +157,18 @@ def _placeholder_str(self, modality: ModalityStr,
hf_config.image_token_index)
if model_type in ("chameleon", "internvl_chat"):
return "<image>"
if model_type == "qwen2_vl":
return "<|vision_start|><|image_pad|><|vision_end|>"

raise TypeError(f"Unknown model type: {model_type}")
elif modality == "audio":
if model_type == "ultravox":
return "<|reserved_special_token_0|>"
raise TypeError(f"Unknown model type: {model_type}")
elif modality == "video":
if model_type == "qwen2_vl":
return "<|vision_start|><|video_pad|><|vision_end|>"
raise TypeError(f"Unknown model type: {model_type}")
fyabc marked this conversation as resolved.
Show resolved Hide resolved
else:
raise TypeError(f"Unknown modality: {modality}")

Expand Down
Loading
Loading