-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add GLM-4v support and meet vllm==0.6.1.post2+cu123 #8663
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
We are very much looking forward to the author considering the merge of this PR, as it will make it easier for developers using GLM-4V to work with the model on the vLLM framework. |
Overall looks good with a glance at. It would be better for this PR with these things done:
|
def merge_glm_vision_embeddings( | ||
input_ids: torch.Tensor, | ||
inputs_embeds: torch.Tensor, | ||
vision_embeddings: torch.Tensor, | ||
boi_token_id: int, | ||
eoi_token_id: int, | ||
) -> torch.Tensor: | ||
boi_positions = (input_ids == boi_token_id).nonzero(as_tuple=True)[0] | ||
eoi_positions = (input_ids == eoi_token_id).nonzero(as_tuple=True)[0] | ||
|
||
mask = torch.zeros_like(input_ids, dtype=torch.bool) | ||
|
||
for boi_pos, eoi_pos in zip(boi_positions, eoi_positions): | ||
assert boi_pos < eoi_pos | ||
mask[boi_pos:eoi_pos + 1] = True | ||
inputs_embeds[mask] = vision_embeddings.view(-1, | ||
vision_embeddings.shape[-1]) | ||
return inputs_embeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use merge_multimodal_embeddings
from vllm/model_executor/models/utils.py
here? Seems that it's doing the same thing, just different from the indexing method.
I assume the vision part of input_ids
in glm
is something like [<boi>, <img_token>, ... , <img_token>, <eoi>, ...]
, please correct me if it's wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use two parameters boi_token_id
and eoi_token_id
to locate the placeholder_token_id
, so it does not apply to the merge_multimodal_embeddings
method in vllm/model_executor/models/utils.py
kv_caches=kv_caches, | ||
attn_metadata=attn_metadata, | ||
) | ||
return hidden_states | ||
|
||
|
||
class ChatGLMForCausalLM(nn.Module, SupportsLoRA): | ||
@MULTIMODAL_REGISTRY.register_image_input_mapper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VLLM expected user input image data as PIL.Image
instead of Tensors
directly. And PIL.Image
would be processed to Tensors
in image_input_mapper.
You might need to implement an input mapper for GLM-4v, because it seems that there is no preprocessor implemented in model repo, the default image_input_mapper may not work with PIL.Image
inputs here.
elif isinstance(pixel_values, list): | ||
return torch.concat(pixel_values) | ||
else: | ||
raise TypeError("""pixel_values must be a torch.Tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will keep the whitespace/new lines in the error. Could you use something like
raise TypeError("pixel_values must be a torch.Tensor "
"or a list of torch.Tensor")
instead?
|
||
vision_config = getattr(hf_config, 'vision_config', None) | ||
if vision_config is None: | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be 0 if there's no vision config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If vision_config
is equal to 0, an error "ValueError: You should set the number of tokens to a positive integer. Found: 0" will be reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize there was validation for that in register_max_multimodal_tokens
- disregard this comment then, thanks! 🙂
if vision_config is None: | ||
return llm_inputs | ||
elif isinstance(vision_config, dict): | ||
image_placeholder_length = (vision_config["image_size"] // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the placeholder calculation out to a common place since it's used in multiple places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will modify it in the new commit
else: | ||
pixel_values = torch.concat(list(pixel_values)) | ||
elif isinstance(pixel_values, list): | ||
return torch.concat(pixel_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason pixel values get returned directly here if it's a list instead of merging the multimodal embeddings & running the encoder?
is_weight_to_be_merge = False | ||
for _, merged_weight_dict in merged_weights_dict.items(): | ||
if name in merged_weight_dict: | ||
assert merged_weight_dict[name] is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you switch these assertions to raise exceptions or add messages to them so that it's more clear what is happening if they fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will modify it in the new commit
inputs_embeds = self.embedding(input_ids) | ||
pixel_values = kwargs.pop("image_embeds", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume kwargs["image_embeds"]
are expected to be normalized images, right? If you write a custom mapper, can you map it to pixel_values
? It's a bit confusing otherwise since a lot of the vision models support passing in the visual embeddings directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will modify it in the new commit
|
||
if isinstance(pixel_values, torch.Tensor): | ||
if pixel_values.ndim == 2: | ||
pixel_values = pixel_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain when the ndim is expected to be 2 instead of 4 (i.e., (B, C, H, W)
)?
Also, since this isn't doing anything in this case, can this be changed to something like
if pixel_values.ndim != 2:
pixel_values = torch.concat(list(pixel_values))
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will modify it in the new commit
pixel_values = kwargs.pop("image_embeds", None) | ||
if pixel_values is not None and self.vision is not None: | ||
|
||
if isinstance(pixel_values, torch.Tensor): | ||
if pixel_values.ndim == 2: | ||
pixel_values = pixel_values | ||
else: | ||
pixel_values = torch.concat(list(pixel_values)) | ||
elif isinstance(pixel_values, list): | ||
return torch.concat(pixel_values) | ||
else: | ||
raise TypeError("""pixel_values must be a torch.Tensor | ||
or a list of torch.Tensor | ||
""") | ||
|
||
pixel_values = pixel_values.to(dtype=inputs_embeds.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this part to a separated _parse_and_validate_image_input
that returns GLMImagePixelInputs
just like other VLMs implemented in vllm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will modify it in the new commit
For 1, I have downloaded the model locally, and then used |
@sixsixcoder Is the model path correct? I tested with my local checkpoint on your branch and it can be loaded correctly. Here is a sample test: # tests/models/decoder_only/vision_language/test_glm4v.py
from typing import List, Optional, Tuple, Type
import pytest
from vllm.multimodal.utils import rescale_image_size
from vllm.utils import is_cpu
from ....conftest import (IMAGE_ASSETS, HfRunner, PromptImageInput, VllmRunner)
HF_IMAGE_PROMPTS = IMAGE_ASSETS.prompts({
"stop_sign":
"<|user|>\n<|image_1|>\nWhat's the content of the image?<|end|>\n<|assistant|>\n", # noqa: E501
"cherry_blossom":
"<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n",
})
models = ["/data/LLM-model/glm-4v-9b"]
target_dtype = "half"
if is_cpu():
target_dtype = "bfloat16"
def run_test(
hf_runner: Type[HfRunner],
vllm_runner: Type[VllmRunner],
inputs: List[Tuple[List[str], PromptImageInput]],
model: str,
*,
dtype: str,
max_tokens: int,
num_logprobs: int,
mm_limit: int,
tensor_parallel_size: int,
distributed_executor_backend: Optional[str] = None,
):
# max_model_len should be greater than image_feature_size
with vllm_runner(model,
max_model_len=4096,
max_num_seqs=1,
dtype=dtype,
limit_mm_per_prompt={"image": mm_limit},
tensor_parallel_size=tensor_parallel_size,
distributed_executor_backend=distributed_executor_backend,
enforce_eager=True) as vllm_model:
pass
with hf_runner(model, dtype=dtype) as hf_model:
pass
@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
"size_factors",
[
# No image
[],
],
)
@pytest.mark.parametrize("dtype", [target_dtype])
@pytest.mark.parametrize("max_tokens", [128])
@pytest.mark.parametrize("num_logprobs", [10])
def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
dtype: str, max_tokens: int, num_logprobs: int) -> None:
images = [asset.pil_image for asset in image_assets]
inputs_per_image = [(
[prompt for _ in size_factors],
[rescale_image_size(image, factor) for factor in size_factors],
) for image, prompt in zip(images, HF_IMAGE_PROMPTS)]
run_test(
hf_runner,
vllm_runner,
inputs_per_image,
model,
dtype=dtype,
max_tokens=max_tokens,
num_logprobs=num_logprobs,
mm_limit=1,
tensor_parallel_size=1,
) Outputs:
|
# @lru_cache | ||
def cached_get_image_processor( | ||
processor_name: str, | ||
*args, | ||
trust_remote_code: bool = False, | ||
**kwargs, | ||
): | ||
"""Gets an image processor for the given model name via HuggingFace.""" | ||
# don't put this import at the top level | ||
# it will call torch.cuda.device_count() | ||
from transformers import AutoTokenizer | ||
|
||
try: | ||
processor = AutoTokenizer.from_pretrained( | ||
processor_name, | ||
*args, | ||
trust_remote_code=trust_remote_code, | ||
**kwargs) | ||
image_processor = processor.apply_chat_template | ||
except ValueError as e: | ||
# If the error pertains to the processor class not existing or not | ||
# currently being imported, suggest using the --trust-remote-code flag. | ||
# Unlike AutoTokenizer, AutoImageProcessor does not separate such errors | ||
if not trust_remote_code: | ||
err_msg = ( | ||
"Failed to load the image processor. If the image processor is " | ||
"a custom processor not yet available in the HuggingFace " | ||
"transformers library, consider setting " | ||
"`trust_remote_code=True` in LLM or using the " | ||
"`--trust-remote-code` flag in the CLI.") | ||
raise RuntimeError(err_msg) from e | ||
else: | ||
raise e | ||
|
||
return image_processor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this copy and hacking necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the default image_processor
method of vllm is not applicable to GLM-4v, I rewrote the image_processor
method to ensure that the code can work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since glm-4v
doesn't have image_processor
implemented and uses tokenizer to process multi-modal data, we can use cached_get_tokenizer
in vllm/vllm/multimodal/utils.py:
Line 18 in cdc72e3
cached_get_tokenizer = lru_cache(get_tokenizer) |
So that the process implementation would be like this:
tokenizer = cached_get_tokenizer(...)
raw_batch_data = tokenizer.apply_chat_template(conversation=..., ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your reply, I have modified it in commit 2c89931
try: | ||
raw_batch_data = image_processor(conversation=[{ | ||
"role": | ||
"user", | ||
"image": | ||
llm_inputs['multi_modal_data']["image"], | ||
"content": | ||
llm_inputs['prompt'] | ||
}], | ||
add_generation_prompt=True, | ||
tokenize=True, | ||
return_tensors="pt", | ||
return_dict=True).data | ||
except Exception: | ||
logger.error("Failed to process content (%s)", llm_inputs['prompt']) | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this try ... except ...
statement just for debugging? I think we should raise error explicitly if user might input image/prompt unsupported by the image_processor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review and reply, I can modify the error message here
I have updated the code according to this requirement. Points 2 and 3 have been completed, but I encountered a problem with the Points 1. Glm-4v does not have a |
@sixsixcoder You can do some hacking refer to the vllm/tests/models/decoder_only/vision_language/test_internvl.py Lines 153 to 154 in cdc72e3
Since hf_model.model.get_output_embeddings = lambda: \
hf_model.model.transformer.output_layer |
Thanks for your reply, I have modified it in commit 2c89931 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits :)
tensor_parallel_size=1, | ||
max_model_len=8192, | ||
trust_remote_code=True, | ||
# gpu_memory_utilization=0.5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that you forgot to remove this commented args used for debugging.
# gpu_memory_utilization=0.9, | ||
enforce_eager=True) as vllm_model: | ||
# tokenizer = vllm_model.model.get_tokenizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
max_tokens, | ||
num_logprobs=num_logprobs, | ||
images=images, | ||
# tokenizer=tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your reply, I will edit it immediately
@sixsixcoder Can you solve the conflicts with the main branch and update the branch as well? Thanks! |
Overview
This PR support the glm-4v-9b multimodal model while maintaining compatibility with
chatglm
.This PR was inspired and reused some code here #5358
Changes
vision_config
forChatGLMConfig
vllm/model_executor/models/glm4_vision_encoder.py
.vision
module forChatGLMModel
, makingChatGLMForCausalLM
multimodal capable.image_embeds
parametersDevelopment Environment
Usage
glm-4-9b-chat
glm-4v-9b
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Adding or changing kernels
Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.
Tensors
require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.torch.libary.opcheck()
to test the function registration and meta-function for any registered ops. Seetests/kernels
for examples.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!