[Model] Add GLM-4v support and meet vllm==0.6.1.post2+cu123 #8663

sixsixcoder · 2024-09-20T09:18:53Z

Overview

This PR support the glm-4v-9b multimodal model while maintaining compatibility with chatglm.
This PR was inspired and reused some code here #5358

Changes

Add vision_config for ChatGLMConfig
Add glm4 vision encoder in vllm/model_executor/models/glm4_vision_encoder.py.
Add optional vision module for ChatGLMModel, making ChatGLMForCausalLM multimodal capable.
Added support for receiving and processing image_embeds parameters
Added the weight loading logic

Development Environment

vllm==0.6.1.post2+cu123
vllm-flash-attn==2.6.1
transformers==4.44.2
torch==2.4.0
torchvision==0.19.0
cuda==12.3
python==3.10

Usage

glm-4-9b-chat

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt

max_model_len, tp_size = 8192, 1
model_name = "THUDM/glm-4-9b-chat"

llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0, max_tokens=1024, stop_token_ids=stop_token_ids)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query = 'Hi!'
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": query}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True
)

input_ids = inputs['input_ids'][0].tolist()

outputs = llm.generate(
    TokensPrompt(**{
        "prompt_token_ids": input_ids,
    }),
    sampling_params=sampling_params
)

print(outputs[0].outputs[0].text)

Hi 👋! I'm ChatGLM, the artificial intelligence assistant, nice to meet you. Feel free to ask me any questions.

glm-4v-9b

from PIL import Image
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt

max_model_len, tp_size = 8192, 1
model_name = "THUDM/glm-4v-9b"

llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0, max_tokens=1024, stop_token_ids=stop_token_ids)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query = 'Describe this picture.'
image = Image.open("docs/source/assets/logos/vllm-logo-text-light.png").convert('RGB')
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "image": image, "content": query}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True
)

image_tensor = inputs['images']

input_ids = inputs['input_ids'][0].tolist()

outputs = llm.generate(
    TokensPrompt(**{
        "prompt_token_ids": input_ids,
        "multi_modal_data":  {"image": image_tensor},
    }),
    sampling_params=sampling_params
)

print(outputs[0].outputs[0].text)

The image shows a logo with the letters "LLM" in uppercase, bold font. The "L" and "M" are in a dark grey or black color, while the "L" also has a slight shadow effect, giving it a three-dimensional appearance. The "L" on the left side of the logo is unique; it is stylized with a blue and a yellow shape that resembles a flag or a small arrow pointing upwards, with the blue shape being the larger and the yellow shape being the smaller, triangular extension on the right side of the blue shape. The background of the logo is a solid, dark color, which contrasts sharply with the lighter colors of the "L" and the blue and yellow shapes.

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-09-20T09:19:04Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

zRzRzRzRzRzRzR · 2024-09-20T10:23:05Z

We are very much looking forward to the author considering the merge of this PR, as it will make it easier for developers using GLM-4V to work with the model on the vLLM framework.

Isotr0py · 2024-09-20T13:57:28Z

Overall looks good with a glance at. It would be better for this PR with these things done:

Add a model test for GLM-4v, you can refer to the tests in tests/models/decoder_only/vision_language
Add GLM-4v example to examples/offline_inference_vision_language.py. If this PR also adds multi-images support, you need to add it to examples/offline_inference_vision_language_multi_image.py as well.
Update the _placeholder_str with GLM-4v's placeholder for BaseMultiModalItemTracker in vllm/entrypoints/chat_utils.py, so that GLM-4v model can be compatible with OpenAI server.

Isotr0py · 2024-09-20T13:58:13Z

vllm/model_executor/models/chatglm.py

+def merge_glm_vision_embeddings(
+    input_ids: torch.Tensor,
+    inputs_embeds: torch.Tensor,
+    vision_embeddings: torch.Tensor,
+    boi_token_id: int,
+    eoi_token_id: int,
+) -> torch.Tensor:
+    boi_positions = (input_ids == boi_token_id).nonzero(as_tuple=True)[0]
+    eoi_positions = (input_ids == eoi_token_id).nonzero(as_tuple=True)[0]
+
+    mask = torch.zeros_like(input_ids, dtype=torch.bool)
+
+    for boi_pos, eoi_pos in zip(boi_positions, eoi_positions):
+        assert boi_pos < eoi_pos
+        mask[boi_pos:eoi_pos + 1] = True
+    inputs_embeds[mask] = vision_embeddings.view(-1,
+                                                 vision_embeddings.shape[-1])
+    return inputs_embeds


Can't we use merge_multimodal_embeddings from vllm/model_executor/models/utils.py here? Seems that it's doing the same thing, just different from the indexing method.

I assume the vision part of input_ids in glm is something like [<boi>, <img_token>, ... , <img_token>, <eoi>, ...], please correct me if it's wrong.

We use two parameters boi_token_id and eoi_token_id to locate the placeholder_token_id, so it does not apply to the merge_multimodal_embeddings method in vllm/model_executor/models/utils.py

Isotr0py · 2024-09-20T14:49:32Z

vllm/model_executor/models/chatglm.py

            kv_caches=kv_caches,
            attn_metadata=attn_metadata,
        )
        return hidden_states


-class ChatGLMForCausalLM(nn.Module, SupportsLoRA):
+@MULTIMODAL_REGISTRY.register_image_input_mapper()


VLLM expected user input image data as PIL.Image instead of Tensors directly. And PIL.Image would be processed to Tensors in image_input_mapper.

You might need to implement an input mapper for GLM-4v, because it seems that there is no preprocessor implemented in model repo, the default image_input_mapper may not work with PIL.Image inputs here.

alex-jw-brooks · 2024-09-20T16:57:01Z

vllm/model_executor/models/chatglm.py

+            elif isinstance(pixel_values, list):
+                return torch.concat(pixel_values)
+            else:
+                raise TypeError("""pixel_values must be a torch.Tensor 


This will keep the whitespace/new lines in the error. Could you use something like

raise TypeError("pixel_values must be a torch.Tensor " "or a list of torch.Tensor")

instead?

alex-jw-brooks · 2024-09-20T17:01:32Z

vllm/model_executor/models/chatglm.py

+
+    vision_config = getattr(hf_config, 'vision_config', None)
+    if vision_config is None:
+        return 1


Should this be 0 if there's no vision config?

If vision_config is equal to 0, an error "ValueError: You should set the number of tokens to a positive integer. Found: 0" will be reported.

Ah, I didn't realize there was validation for that in register_max_multimodal_tokens - disregard this comment then, thanks! 🙂

alex-jw-brooks · 2024-09-20T17:03:52Z

vllm/model_executor/models/chatglm.py

+    if vision_config is None:
+        return llm_inputs
+    elif isinstance(vision_config, dict):
+        image_placeholder_length = (vision_config["image_size"] //


Could you move the placeholder calculation out to a common place since it's used in multiple places?

Ok, I will modify it in the new commit

alex-jw-brooks · 2024-09-20T17:19:15Z

vllm/model_executor/models/chatglm.py

+                else:
+                    pixel_values = torch.concat(list(pixel_values))
+            elif isinstance(pixel_values, list):
+                return torch.concat(pixel_values)


Is there a reason pixel values get returned directly here if it's a list instead of merging the multimodal embeddings & running the encoder?

alex-jw-brooks · 2024-09-20T17:21:15Z

vllm/model_executor/models/chatglm.py

+            is_weight_to_be_merge = False
+            for _, merged_weight_dict in merged_weights_dict.items():
+                if name in merged_weight_dict:
+                    assert merged_weight_dict[name] is None


Can you switch these assertions to raise exceptions or add messages to them so that it's more clear what is happening if they fail?

Ok, I will modify it in the new commit

alex-jw-brooks · 2024-09-20T17:31:10Z

vllm/model_executor/models/chatglm.py

        inputs_embeds = self.embedding(input_ids)
+        pixel_values = kwargs.pop("image_embeds", None)


I assume kwargs["image_embeds"] are expected to be normalized images, right? If you write a custom mapper, can you map it to pixel_values? It's a bit confusing otherwise since a lot of the vision models support passing in the visual embeddings directly

Ok, I will modify it in the new commit

alex-jw-brooks · 2024-09-20T17:34:08Z

vllm/model_executor/models/chatglm.py

+
+            if isinstance(pixel_values, torch.Tensor):
+                if pixel_values.ndim == 2:
+                    pixel_values = pixel_values


Can you please explain when the ndim is expected to be 2 instead of 4 (i.e., (B, C, H, W))?

Also, since this isn't doing anything in this case, can this be changed to something like

if pixel_values.ndim != 2: pixel_values = torch.concat(list(pixel_values))

instead?

Ok, I will modify it in the new commit

Isotr0py · 2024-09-21T02:32:34Z

vllm/model_executor/models/chatglm.py

+        pixel_values = kwargs.pop("image_embeds", None)
+        if pixel_values is not None and self.vision is not None:
+
+            if isinstance(pixel_values, torch.Tensor):
+                if pixel_values.ndim == 2:
+                    pixel_values = pixel_values
+                else:
+                    pixel_values = torch.concat(list(pixel_values))
+            elif isinstance(pixel_values, list):
+                return torch.concat(pixel_values)
+            else:
+                raise TypeError("""pixel_values must be a torch.Tensor 
+                    or a list of torch.Tensor
+                    """)
+
+            pixel_values = pixel_values.to(dtype=inputs_embeds.dtype)


Can you move this part to a separated _parse_and_validate_image_input that returns GLMImagePixelInputs just like other VLMs implemented in vllm?

Ok, I will modify it in the new commit

sixsixcoder · 2024-09-25T04:08:03Z

总体看起来不错。如果能完成以下这些事情，这个 PR 会更好：

添加 GLM-4v 的模型测试，您可以参考tests/models/decoder_only/vision_language

将 GLM-4v 示例添加到。如果此 PR 还添加了多图像支持，则也examples/offline_inference_vision_language.py需要将其添加到。examples/offline_inference_vision_language_multi_image.py

使用_placeholder_strGLM-4v 的占位符更新，以便 GLM-4v 模型可以与 OpenAI 服务器兼容。BaseMultiModalItemTracker``vllm/entrypoints/chat_utils.py

For 1, I have downloaded the model locally, and then used HfRunner, VllmRunner to load the model in the local path, and an error will be reported: "huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'." How should I deal with it?

Isotr0py · 2024-09-25T04:54:27Z

@sixsixcoder Is the model path correct? I tested with my local checkpoint on your branch and it can be loaded correctly.

Here is a sample test:

# tests/models/decoder_only/vision_language/test_glm4v.py
from typing import List, Optional, Tuple, Type

import pytest

from vllm.multimodal.utils import rescale_image_size
from vllm.utils import is_cpu

from ....conftest import (IMAGE_ASSETS, HfRunner, PromptImageInput, VllmRunner)

HF_IMAGE_PROMPTS = IMAGE_ASSETS.prompts({
    "stop_sign":
    "<|user|>\n<|image_1|>\nWhat's the content of the image?<|end|>\n<|assistant|>\n",  # noqa: E501
    "cherry_blossom":
    "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n",
})

models = ["/data/LLM-model/glm-4v-9b"]


target_dtype = "half"
if is_cpu():
    target_dtype = "bfloat16"


def run_test(
    hf_runner: Type[HfRunner],
    vllm_runner: Type[VllmRunner],
    inputs: List[Tuple[List[str], PromptImageInput]],
    model: str,
    *,
    dtype: str,
    max_tokens: int,
    num_logprobs: int,
    mm_limit: int,
    tensor_parallel_size: int,
    distributed_executor_backend: Optional[str] = None,
):
    # max_model_len should be greater than image_feature_size
    with vllm_runner(model,
                     max_model_len=4096,
                     max_num_seqs=1,
                     dtype=dtype,
                     limit_mm_per_prompt={"image": mm_limit},
                     tensor_parallel_size=tensor_parallel_size,
                     distributed_executor_backend=distributed_executor_backend,
                     enforce_eager=True) as vllm_model:
        pass

    with hf_runner(model, dtype=dtype) as hf_model:
        pass


@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
    "size_factors",
    [
        # No image
        [],
    ],
)
@pytest.mark.parametrize("dtype", [target_dtype])
@pytest.mark.parametrize("max_tokens", [128])
@pytest.mark.parametrize("num_logprobs", [10])
def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
                dtype: str, max_tokens: int, num_logprobs: int) -> None:
    images = [asset.pil_image for asset in image_assets]

    inputs_per_image = [(
        [prompt for _ in size_factors],
        [rescale_image_size(image, factor) for factor in size_factors],
    ) for image, prompt in zip(images, HF_IMAGE_PROMPTS)]

    run_test(
        hf_runner,
        vllm_runner,
        inputs_per_image,
        model,
        dtype=dtype,
        max_tokens=max_tokens,
        num_logprobs=num_logprobs,
        mm_limit=1,
        tensor_parallel_size=1,
    )

Outputs:

$ pytest -s -v tests/models/decoder_only//vision_language/test_glm4v.py
=================================================================================================== test session starts ===================================================================================================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /home/c4rbon/miniconda3/envs/vllm/bin/python
cachedir: .pytest_cache
rootdir: /home/c4rbon/github-repos/vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, asyncio-0.23.7, buildkite-test-collector-0.1.8, shard-0.1.2, anyio-4.4.0, forked-1.6.0
asyncio: mode=strict
collected 1 item                                                                                                                                                                                                          
Running 1 items in this shard: tests/models/decoder_only/vision_language/test_glm4v.py::test_models[10-128-bfloat16-size_factors0-/data/LLM-model/glm-4v-9b]

tests/models/decoder_only/vision_language/test_glm4v.py::test_models[10-128-bfloat16-size_factors0-/data/LLM-model/glm-4v-9b] WARNING 09-25 12:48:46 config.py:348] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 09-25 12:48:46 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='/data/LLM-model/glm-4v-9b', speculative_config=None, tokenizer='/data/LLM-model/glm-4v-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/LLM-model/glm-4v-9b, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
WARNING 09-25 12:48:46 tokenizer.py:157] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 09-25 12:48:46 cpu_executor.py:354] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 09-25 12:48:46 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 09-25 12:48:46 selector.py:128] Using Torch SDPA backend.
INFO 09-25 12:48:46 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 09-25 12:48:46 selector.py:128] Using Torch SDPA backend.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:04<01:07,  4.82s/it]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:23<02:51, 13.18s/it]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:48<03:43, 18.61s/it]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [01:08<03:27, 18.83s/it]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [01:26<03:07, 18.78s/it]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [01:47<02:55, 19.51s/it]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [02:08<02:40, 20.09s/it]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [02:27<02:17, 19.63s/it]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [02:45<01:53, 18.94s/it]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [03:06<01:38, 19.73s/it]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [03:25<01:18, 19.55s/it]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [03:48<01:01, 20.46s/it]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [04:08<00:40, 20.47s/it]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [04:14<00:15, 15.96s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [04:35<00:00, 17.56s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [04:35<00:00, 18.37s/it]

INFO 09-25 12:53:25 cpu_executor.py:212] # CPU blocks: 6553
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:07<00:00,  2.10it/s]
PASSED

============================================================================================== 1 passed in 294.90s (0:04:54) ==============================================================================================

Isotr0py · 2024-10-09T04:52:41Z

vllm/model_executor/models/chatglm.py

+# @lru_cache
+def cached_get_image_processor(
+    processor_name: str,
+    *args,
+    trust_remote_code: bool = False,
+    **kwargs,
+):
+    """Gets an image processor for the given model name via HuggingFace."""
+    # don't put this import at the top level
+    # it will call torch.cuda.device_count()
+    from transformers import AutoTokenizer
+
+    try:
+        processor = AutoTokenizer.from_pretrained(
+            processor_name,
+            *args,
+            trust_remote_code=trust_remote_code,
+            **kwargs)
+        image_processor = processor.apply_chat_template
+    except ValueError as e:
+        # If the error pertains to the processor class not existing or not
+        # currently being imported, suggest using the --trust-remote-code flag.
+        # Unlike AutoTokenizer, AutoImageProcessor does not separate such errors
+        if not trust_remote_code:
+            err_msg = (
+                "Failed to load the image processor. If the image processor is "
+                "a custom processor not yet available in the HuggingFace "
+                "transformers library, consider setting "
+                "`trust_remote_code=True` in LLM or using the "
+                "`--trust-remote-code` flag in the CLI.")
+            raise RuntimeError(err_msg) from e
+        else:
+            raise e
+
+    return image_processor


Is this copy and hacking necessary?

Since the default image_processor method of vllm is not applicable to GLM-4v, I rewrote the image_processor method to ensure that the code can work

Since glm-4v doesn't have image_processor implemented and uses tokenizer to process multi-modal data, we can use cached_get_tokenizer in vllm/vllm/multimodal/utils.py:

vllm/vllm/multimodal/utils.py

Line 18 in cdc72e3

cached_get_tokenizer = lru_cache(get_tokenizer)

So that the process implementation would be like this:

tokenizer = cached_get_tokenizer(...) raw_batch_data = tokenizer.apply_chat_template(conversation=..., ...)

Thanks for your reply, I have modified it in commit 2c89931

Isotr0py · 2024-10-09T04:59:18Z

vllm/model_executor/models/chatglm.py

+    try:
+        raw_batch_data = image_processor(conversation=[{
+            "role":
+            "user",
+            "image":
+            llm_inputs['multi_modal_data']["image"],
+            "content":
+            llm_inputs['prompt']
+        }],
+                                         add_generation_prompt=True,
+                                         tokenize=True,
+                                         return_tensors="pt",
+                                         return_dict=True).data
+    except Exception:
+        logger.error("Failed to process content (%s)", llm_inputs['prompt'])
+        raise


Is this try ... except ... statement just for debugging? I think we should raise error explicitly if user might input image/prompt unsupported by the image_processor.

Thank you for your review and reply, I can modify the error message here

sixsixcoder · 2024-10-09T06:34:56Z

Overall looks good with a glance at. It would be better for this PR with these things done:

Add a model test for GLM-4v, you can refer to the tests in tests/models/decoder_only/vision_language

Add GLM-4v example to examples/offline_inference_vision_language.py. If this PR also adds multi-images support, you need to add it to examples/offline_inference_vision_language_multi_image.py as well.

Update the _placeholder_str with GLM-4v's placeholder for BaseMultiModalItemTracker in vllm/entrypoints/chat_utils.py, so that GLM-4v model can be compatible with OpenAI server.

I have updated the code according to this requirement. Points 2 and 3 have been completed, but I encountered a problem with the Points 1. Glm-4v does not have a get_output_embeddings method, but get_output_embeddings is needed to calculate logprobs in vllm/tests/conftest.py. Do you have any solution?

Isotr0py · 2024-10-09T07:02:42Z

Glm-4v does not have a get_output_embeddings method, but get_output_embeddings is needed to calculate logprobs in vllm/tests/conftest.py.

@sixsixcoder You can do some hacking refer to the test_internvl.py:

vllm/tests/models/decoder_only/vision_language/test_internvl.py

Lines 153 to 154 in cdc72e3

    
           hf_model.model.get_output_embeddings = lambda: \ 
        
               hf_model.model.language_model.get_output_embeddings()

Since Glm-4v doesn't have get_output_embeddings method in llm backbone as well, the wrapped method might like this (I assumed the lm_head in GLM is named output_layer):

hf_model.model.get_output_embeddings = lambda: \ 
     hf_model.model.transformer.output_layer

sixsixcoder · 2024-10-10T06:37:13Z

Glm-4v does not have a get_output_embeddings method, but get_output_embeddings is needed to calculate logprobs in vllm/tests/conftest.py.

@sixsixcoder You can do some hacking refer to the test_internvl.py:

vllm/tests/models/decoder_only/vision_language/test_internvl.py

Lines 153 to 154 in cdc72e3

hf_model.model.get_output_embeddings = lambda: \

hf_model.model.language_model.get_output_embeddings()

Since Glm-4v doesn't have get_output_embeddings method in llm backbone as well, the wrapped method might like this (I assumed the lm_head in GLM is named output_layer):
hf_model.model.get_output_embeddings = lambda: \ 
     hf_model.model.transformer.output_layer

Thanks for your reply, I have modified it in commit 2c89931

Isotr0py

Just some minor nits :)

Isotr0py · 2024-10-10T06:48:52Z

examples/offline_inference_vision_language.py

+        tensor_parallel_size=1,
+        max_model_len=8192,
+        trust_remote_code=True,
+        # gpu_memory_utilization=0.5,


Seems that you forgot to remove this commented args used for debugging.

Isotr0py · 2024-10-10T06:51:25Z

tests/models/decoder_only/vision_language/test_glm4.py

+            #  gpu_memory_utilization=0.9,
+            enforce_eager=True) as vllm_model:
+        # tokenizer = vllm_model.model.get_tokenizer()


Isotr0py · 2024-10-10T06:51:59Z

tests/models/decoder_only/vision_language/test_glm4.py

+                                                    max_tokens,
+                                                    num_logprobs=num_logprobs,
+                                                    images=images,
+                                                    # tokenizer=tokenizer


Thank you for your reply, I will edit it immediately

Isotr0py · 2024-10-10T07:02:50Z

@sixsixcoder Can you solve the conflicts with the main branch and update the branch as well? Thanks!

DarkLight1337 mentioned this pull request Sep 20, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

84 tasks

Isotr0py reviewed Sep 20, 2024

View reviewed changes

alex-jw-brooks reviewed Sep 20, 2024

View reviewed changes

Isotr0py reviewed Sep 21, 2024

View reviewed changes

sixsixcoder requested review from DarkLight1337 and ywang96 as code owners October 9, 2024 03:31

Isotr0py reviewed Oct 9, 2024

View reviewed changes

Isotr0py reviewed Oct 10, 2024

View reviewed changes

sixsixcoder closed this Oct 10, 2024

sixsixcoder force-pushed the glm-4v branch from 13f675d to f3a507f Compare October 10, 2024 09:38

sixsixcoder mentioned this pull request Oct 10, 2024

[Model] Add GLM-4v support and meet vllm==0.6.2 #9242

Merged

		inputs_embeds = self.embedding(input_ids)
		pixel_values = kwargs.pop("image_embeds", None)

[Model] Add GLM-4v support and meet vllm==0.6.1.post2+cu123 #8663

[Model] Add GLM-4v support and meet vllm==0.6.1.post2+cu123 #8663

Conversation

sixsixcoder commented Sep 20, 2024 • edited Loading

Overview

Changes

Development Environment

Usage

PR Title and Classification

Code Quality

Adding or changing kernels

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Sep 20, 2024

zRzRzRzRzRzRzR commented Sep 20, 2024

Isotr0py commented Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sixsixcoder commented Sep 25, 2024

Isotr0py commented Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Isotr0py Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Isotr0py Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sixsixcoder commented Oct 9, 2024

Isotr0py commented Oct 9, 2024

sixsixcoder commented Oct 10, 2024

Isotr0py left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Isotr0py commented Oct 10, 2024 • edited Loading

sixsixcoder commented Sep 20, 2024 •

edited

Loading

Isotr0py commented Sep 20, 2024 •

edited

Loading

Isotr0py commented Sep 25, 2024 •

edited

Loading

Isotr0py Oct 9, 2024 •

edited

Loading

Isotr0py Oct 9, 2024 •

edited

Loading

Isotr0py commented Oct 10, 2024 •

edited

Loading