Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Add GLM-4v support and meet vllm==0.6.1.post2+cu123 #8663

Closed
wants to merge 0 commits into from

Conversation

sixsixcoder
Copy link
Contributor

@sixsixcoder sixsixcoder commented Sep 20, 2024

Overview

This PR support the glm-4v-9b multimodal model while maintaining compatibility with chatglm.
This PR was inspired and reused some code here #5358

Changes

  1. Add vision_config for ChatGLMConfig
  2. Add glm4 vision encoder in vllm/model_executor/models/glm4_vision_encoder.py.
  3. Add optional vision module for ChatGLMModel, making ChatGLMForCausalLM multimodal capable.
  4. Added support for receiving and processing image_embeds parameters
  5. Added the weight loading logic

Development Environment

vllm==0.6.1.post2+cu123
vllm-flash-attn==2.6.1
transformers==4.44.2
torch==2.4.0
torchvision==0.19.0
cuda==12.3
python==3.10

Usage

glm-4-9b-chat

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt

max_model_len, tp_size = 8192, 1
model_name = "THUDM/glm-4-9b-chat"

llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0, max_tokens=1024, stop_token_ids=stop_token_ids)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query = 'Hi!'
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": query}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True
)

input_ids = inputs['input_ids'][0].tolist()

outputs = llm.generate(
    TokensPrompt(**{
        "prompt_token_ids": input_ids,
    }),
    sampling_params=sampling_params
)

print(outputs[0].outputs[0].text)
Hi 👋! I'm ChatGLM, the artificial intelligence assistant, nice to meet you. Feel free to ask me any questions.

glm-4v-9b

from PIL import Image
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.inputs import TokensPrompt

max_model_len, tp_size = 8192, 1
model_name = "THUDM/glm-4v-9b"

llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0, max_tokens=1024, stop_token_ids=stop_token_ids)

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

query = 'Describe this picture.'
image = Image.open("docs/source/assets/logos/vllm-logo-text-light.png").convert('RGB')
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "image": image, "content": query}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True
)

image_tensor = inputs['images']

input_ids = inputs['input_ids'][0].tolist()

outputs = llm.generate(
    TokensPrompt(**{
        "prompt_token_ids": input_ids,
        "multi_modal_data":  {"image": image_tensor},
    }),
    sampling_params=sampling_params
)

print(outputs[0].outputs[0].text)
The image shows a logo with the letters "LLM" in uppercase, bold font. The "L" and "M" are in a dark grey or black color, while the "L" also has a slight shadow effect, giving it a three-dimensional appearance. The "L" on the left side of the logo is unique; it is stylized with a blue and a yellow shape that resembles a flag or a small arrow pointing upwards, with the blue shape being the larger and the yellow shape being the smaller, triangular extension on the right side of the blue shape. The background of the logo is a solid, dark color, which contrasts sharply with the lighter colors of the "L" and the blue and yellow shapes.

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Adding or changing kernels

Each custom kernel needs a schema and one or more implementations to be registered with PyTorch.

  • Make sure custom ops are registered following PyTorch guidelines: Custom C++ and CUDA Operators and The Custom Operators Manual
  • Custom operations that return Tensors require meta-functions. Meta-functions should be implemented and registered in python so that dynamic dims can be handled automatically. See above documents for a description of meta-functions.
  • Use torch.libary.opcheck() to test the function registration and meta-function for any registered ops. See tests/kernels for examples.
  • When changing the C++ signature of an existing op, the schema must be updated to reflect the changes.
  • If a new custom type is needed, see the following document: Custom Class Support in PT2.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@zRzRzRzRzRzRzR
Copy link

We are very much looking forward to the author considering the merge of this PR, as it will make it easier for developers using GLM-4V to work with the model on the vLLM framework.

@Isotr0py
Copy link
Collaborator

Isotr0py commented Sep 20, 2024

Overall looks good with a glance at. It would be better for this PR with these things done:

  1. Add a model test for GLM-4v, you can refer to the tests in tests/models/decoder_only/vision_language
  2. Add GLM-4v example to examples/offline_inference_vision_language.py. If this PR also adds multi-images support, you need to add it to examples/offline_inference_vision_language_multi_image.py as well.
  3. Update the _placeholder_str with GLM-4v's placeholder for BaseMultiModalItemTracker in vllm/entrypoints/chat_utils.py, so that GLM-4v model can be compatible with OpenAI server.

Comment on lines 41 to 58
def merge_glm_vision_embeddings(
input_ids: torch.Tensor,
inputs_embeds: torch.Tensor,
vision_embeddings: torch.Tensor,
boi_token_id: int,
eoi_token_id: int,
) -> torch.Tensor:
boi_positions = (input_ids == boi_token_id).nonzero(as_tuple=True)[0]
eoi_positions = (input_ids == eoi_token_id).nonzero(as_tuple=True)[0]

mask = torch.zeros_like(input_ids, dtype=torch.bool)

for boi_pos, eoi_pos in zip(boi_positions, eoi_positions):
assert boi_pos < eoi_pos
mask[boi_pos:eoi_pos + 1] = True
inputs_embeds[mask] = vision_embeddings.view(-1,
vision_embeddings.shape[-1])
return inputs_embeds
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use merge_multimodal_embeddings from vllm/model_executor/models/utils.py here? Seems that it's doing the same thing, just different from the indexing method.

I assume the vision part of input_ids in glm is something like [<boi>, <img_token>, ... , <img_token>, <eoi>, ...], please correct me if it's wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use two parameters boi_token_id and eoi_token_id to locate the placeholder_token_id, so it does not apply to the merge_multimodal_embeddings method in vllm/model_executor/models/utils.py

kv_caches=kv_caches,
attn_metadata=attn_metadata,
)
return hidden_states


class ChatGLMForCausalLM(nn.Module, SupportsLoRA):
@MULTIMODAL_REGISTRY.register_image_input_mapper()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLLM expected user input image data as PIL.Image instead of Tensors directly. And PIL.Image would be processed to Tensors in image_input_mapper.

You might need to implement an input mapper for GLM-4v, because it seems that there is no preprocessor implemented in model repo, the default image_input_mapper may not work with PIL.Image inputs here.

elif isinstance(pixel_values, list):
return torch.concat(pixel_values)
else:
raise TypeError("""pixel_values must be a torch.Tensor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will keep the whitespace/new lines in the error. Could you use something like

raise TypeError("pixel_values must be a torch.Tensor "
     "or a list of torch.Tensor")

instead?


vision_config = getattr(hf_config, 'vision_config', None)
if vision_config is None:
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be 0 if there's no vision config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vision_config is equal to 0, an error "ValueError: You should set the number of tokens to a positive integer. Found: 0" will be reported.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize there was validation for that in register_max_multimodal_tokens - disregard this comment then, thanks! 🙂

if vision_config is None:
return llm_inputs
elif isinstance(vision_config, dict):
image_placeholder_length = (vision_config["image_size"] //
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the placeholder calculation out to a common place since it's used in multiple places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will modify it in the new commit

else:
pixel_values = torch.concat(list(pixel_values))
elif isinstance(pixel_values, list):
return torch.concat(pixel_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason pixel values get returned directly here if it's a list instead of merging the multimodal embeddings & running the encoder?

is_weight_to_be_merge = False
for _, merged_weight_dict in merged_weights_dict.items():
if name in merged_weight_dict:
assert merged_weight_dict[name] is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you switch these assertions to raise exceptions or add messages to them so that it's more clear what is happening if they fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will modify it in the new commit

inputs_embeds = self.embedding(input_ids)
pixel_values = kwargs.pop("image_embeds", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume kwargs["image_embeds"] are expected to be normalized images, right? If you write a custom mapper, can you map it to pixel_values? It's a bit confusing otherwise since a lot of the vision models support passing in the visual embeddings directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will modify it in the new commit


if isinstance(pixel_values, torch.Tensor):
if pixel_values.ndim == 2:
pixel_values = pixel_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain when the ndim is expected to be 2 instead of 4 (i.e., (B, C, H, W))?

Also, since this isn't doing anything in this case, can this be changed to something like

if pixel_values.ndim != 2:
    pixel_values = torch.concat(list(pixel_values))

instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will modify it in the new commit

Comment on lines 463 to 478
pixel_values = kwargs.pop("image_embeds", None)
if pixel_values is not None and self.vision is not None:

if isinstance(pixel_values, torch.Tensor):
if pixel_values.ndim == 2:
pixel_values = pixel_values
else:
pixel_values = torch.concat(list(pixel_values))
elif isinstance(pixel_values, list):
return torch.concat(pixel_values)
else:
raise TypeError("""pixel_values must be a torch.Tensor
or a list of torch.Tensor
""")

pixel_values = pixel_values.to(dtype=inputs_embeds.dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this part to a separated _parse_and_validate_image_input that returns GLMImagePixelInputs just like other VLMs implemented in vllm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will modify it in the new commit

@sixsixcoder
Copy link
Contributor Author

总体看起来不错。如果能完成以下这些事情,这个 PR 会更好:

  1. 添加 GLM-4v 的模型测试,您可以参考tests/models/decoder_only/vision_language
  2. 将 GLM-4v 示例添加到。如果此 PR 还添加了多图像支持,则也examples/offline_inference_vision_language.py需要将其添加到。examples/offline_inference_vision_language_multi_image.py
  3. 使用_placeholder_strGLM-4v 的占位符更新,以便 GLM-4v 模型可以与 OpenAI 服务器兼容。BaseMultiModalItemTracker``vllm/entrypoints/chat_utils.py

For 1, I have downloaded the model locally, and then used HfRunner, VllmRunner to load the model in the local path, and an error will be reported: "huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name'." How should I deal with it?

@Isotr0py
Copy link
Collaborator

Isotr0py commented Sep 25, 2024

@sixsixcoder Is the model path correct? I tested with my local checkpoint on your branch and it can be loaded correctly.

Here is a sample test:

# tests/models/decoder_only/vision_language/test_glm4v.py
from typing import List, Optional, Tuple, Type

import pytest

from vllm.multimodal.utils import rescale_image_size
from vllm.utils import is_cpu

from ....conftest import (IMAGE_ASSETS, HfRunner, PromptImageInput, VllmRunner)

HF_IMAGE_PROMPTS = IMAGE_ASSETS.prompts({
    "stop_sign":
    "<|user|>\n<|image_1|>\nWhat's the content of the image?<|end|>\n<|assistant|>\n",  # noqa: E501
    "cherry_blossom":
    "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n",
})

models = ["/data/LLM-model/glm-4v-9b"]


target_dtype = "half"
if is_cpu():
    target_dtype = "bfloat16"


def run_test(
    hf_runner: Type[HfRunner],
    vllm_runner: Type[VllmRunner],
    inputs: List[Tuple[List[str], PromptImageInput]],
    model: str,
    *,
    dtype: str,
    max_tokens: int,
    num_logprobs: int,
    mm_limit: int,
    tensor_parallel_size: int,
    distributed_executor_backend: Optional[str] = None,
):
    # max_model_len should be greater than image_feature_size
    with vllm_runner(model,
                     max_model_len=4096,
                     max_num_seqs=1,
                     dtype=dtype,
                     limit_mm_per_prompt={"image": mm_limit},
                     tensor_parallel_size=tensor_parallel_size,
                     distributed_executor_backend=distributed_executor_backend,
                     enforce_eager=True) as vllm_model:
        pass

    with hf_runner(model, dtype=dtype) as hf_model:
        pass


@pytest.mark.parametrize("model", models)
@pytest.mark.parametrize(
    "size_factors",
    [
        # No image
        [],
    ],
)
@pytest.mark.parametrize("dtype", [target_dtype])
@pytest.mark.parametrize("max_tokens", [128])
@pytest.mark.parametrize("num_logprobs", [10])
def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
                dtype: str, max_tokens: int, num_logprobs: int) -> None:
    images = [asset.pil_image for asset in image_assets]

    inputs_per_image = [(
        [prompt for _ in size_factors],
        [rescale_image_size(image, factor) for factor in size_factors],
    ) for image, prompt in zip(images, HF_IMAGE_PROMPTS)]

    run_test(
        hf_runner,
        vllm_runner,
        inputs_per_image,
        model,
        dtype=dtype,
        max_tokens=max_tokens,
        num_logprobs=num_logprobs,
        mm_limit=1,
        tensor_parallel_size=1,
    )

Outputs:

$ pytest -s -v tests/models/decoder_only//vision_language/test_glm4v.py
=================================================================================================== test session starts ===================================================================================================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /home/c4rbon/miniconda3/envs/vllm/bin/python
cachedir: .pytest_cache
rootdir: /home/c4rbon/github-repos/vllm
configfile: pyproject.toml
plugins: rerunfailures-14.0, asyncio-0.23.7, buildkite-test-collector-0.1.8, shard-0.1.2, anyio-4.4.0, forked-1.6.0
asyncio: mode=strict
collected 1 item                                                                                                                                                                                                          
Running 1 items in this shard: tests/models/decoder_only/vision_language/test_glm4v.py::test_models[10-128-bfloat16-size_factors0-/data/LLM-model/glm-4v-9b]

tests/models/decoder_only/vision_language/test_glm4v.py::test_models[10-128-bfloat16-size_factors0-/data/LLM-model/glm-4v-9b] WARNING 09-25 12:48:46 config.py:348] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
INFO 09-25 12:48:46 llm_engine.py:223] Initializing an LLM engine (v0.6.1.post2) with config: model='/data/LLM-model/glm-4v-9b', speculative_config=None, tokenizer='/data/LLM-model/glm-4v-9b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/LLM-model/glm-4v-9b, use_v2_block_manager=False, num_scheduler_steps=1, enable_prefix_caching=False, use_async_output_proc=False)
WARNING 09-25 12:48:46 tokenizer.py:157] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
WARNING 09-25 12:48:46 cpu_executor.py:354] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 09-25 12:48:46 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 09-25 12:48:46 selector.py:128] Using Torch SDPA backend.
INFO 09-25 12:48:46 selector.py:183] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 09-25 12:48:46 selector.py:128] Using Torch SDPA backend.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:04<01:07,  4.82s/it]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:23<02:51, 13.18s/it]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:48<03:43, 18.61s/it]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [01:08<03:27, 18.83s/it]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [01:26<03:07, 18.78s/it]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [01:47<02:55, 19.51s/it]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [02:08<02:40, 20.09s/it]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [02:27<02:17, 19.63s/it]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [02:45<01:53, 18.94s/it]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [03:06<01:38, 19.73s/it]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [03:25<01:18, 19.55s/it]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [03:48<01:01, 20.46s/it]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [04:08<00:40, 20.47s/it]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [04:14<00:15, 15.96s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [04:35<00:00, 17.56s/it]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [04:35<00:00, 18.37s/it]

INFO 09-25 12:53:25 cpu_executor.py:212] # CPU blocks: 6553
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:07<00:00,  2.10it/s]
PASSED

============================================================================================== 1 passed in 294.90s (0:04:54) ==============================================================================================

Comment on lines 48 to 82
# @lru_cache
def cached_get_image_processor(
processor_name: str,
*args,
trust_remote_code: bool = False,
**kwargs,
):
"""Gets an image processor for the given model name via HuggingFace."""
# don't put this import at the top level
# it will call torch.cuda.device_count()
from transformers import AutoTokenizer

try:
processor = AutoTokenizer.from_pretrained(
processor_name,
*args,
trust_remote_code=trust_remote_code,
**kwargs)
image_processor = processor.apply_chat_template
except ValueError as e:
# If the error pertains to the processor class not existing or not
# currently being imported, suggest using the --trust-remote-code flag.
# Unlike AutoTokenizer, AutoImageProcessor does not separate such errors
if not trust_remote_code:
err_msg = (
"Failed to load the image processor. If the image processor is "
"a custom processor not yet available in the HuggingFace "
"transformers library, consider setting "
"`trust_remote_code=True` in LLM or using the "
"`--trust-remote-code` flag in the CLI.")
raise RuntimeError(err_msg) from e
else:
raise e

return image_processor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copy and hacking necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the default image_processor method of vllm is not applicable to GLM-4v, I rewrote the image_processor method to ensure that the code can work

Copy link
Collaborator

@Isotr0py Isotr0py Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since glm-4v doesn't have image_processor implemented and uses tokenizer to process multi-modal data, we can use cached_get_tokenizer in vllm/vllm/multimodal/utils.py:

cached_get_tokenizer = lru_cache(get_tokenizer)

So that the process implementation would be like this:

tokenizer = cached_get_tokenizer(...)
raw_batch_data = tokenizer.apply_chat_template(conversation=..., ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your reply, I have modified it in commit 2c89931

Comment on lines 203 to 218
try:
raw_batch_data = image_processor(conversation=[{
"role":
"user",
"image":
llm_inputs['multi_modal_data']["image"],
"content":
llm_inputs['prompt']
}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True).data
except Exception:
logger.error("Failed to process content (%s)", llm_inputs['prompt'])
raise
Copy link
Collaborator

@Isotr0py Isotr0py Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this try ... except ... statement just for debugging? I think we should raise error explicitly if user might input image/prompt unsupported by the image_processor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review and reply, I can modify the error message here

@sixsixcoder
Copy link
Contributor Author

Overall looks good with a glance at. It would be better for this PR with these things done:

  1. Add a model test for GLM-4v, you can refer to the tests in tests/models/decoder_only/vision_language
  2. Add GLM-4v example to examples/offline_inference_vision_language.py. If this PR also adds multi-images support, you need to add it to examples/offline_inference_vision_language_multi_image.py as well.
  3. Update the _placeholder_str with GLM-4v's placeholder for BaseMultiModalItemTracker in vllm/entrypoints/chat_utils.py, so that GLM-4v model can be compatible with OpenAI server.

I have updated the code according to this requirement. Points 2 and 3 have been completed, but I encountered a problem with the Points 1. Glm-4v does not have a get_output_embeddings method, but get_output_embeddings is needed to calculate logprobs in vllm/tests/conftest.py. Do you have any solution?

@Isotr0py
Copy link
Collaborator

Isotr0py commented Oct 9, 2024

Glm-4v does not have a get_output_embeddings method, but get_output_embeddings is needed to calculate logprobs in vllm/tests/conftest.py.

@sixsixcoder You can do some hacking refer to the test_internvl.py:

hf_model.model.get_output_embeddings = lambda: \
hf_model.model.language_model.get_output_embeddings()

Since Glm-4v doesn't have get_output_embeddings method in llm backbone as well, the wrapped method might like this (I assumed the lm_head in GLM is named output_layer):

hf_model.model.get_output_embeddings = lambda: \ 
     hf_model.model.transformer.output_layer

@sixsixcoder
Copy link
Contributor Author

Glm-4v does not have a get_output_embeddings method, but get_output_embeddings is needed to calculate logprobs in vllm/tests/conftest.py.

@sixsixcoder You can do some hacking refer to the test_internvl.py:

hf_model.model.get_output_embeddings = lambda: \
hf_model.model.language_model.get_output_embeddings()

Since Glm-4v doesn't have get_output_embeddings method in llm backbone as well, the wrapped method might like this (I assumed the lm_head in GLM is named output_layer):

hf_model.model.get_output_embeddings = lambda: \ 
     hf_model.model.transformer.output_layer

Thanks for your reply, I have modified it in commit 2c89931

Copy link
Collaborator

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nits :)

tensor_parallel_size=1,
max_model_len=8192,
trust_remote_code=True,
# gpu_memory_utilization=0.5,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that you forgot to remove this commented args used for debugging.

Comment on lines 41 to 43
# gpu_memory_utilization=0.9,
enforce_eager=True) as vllm_model:
# tokenizer = vllm_model.model.get_tokenizer()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

max_tokens,
num_logprobs=num_logprobs,
images=images,
# tokenizer=tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reply, I will edit it immediately

@Isotr0py
Copy link
Collaborator

Isotr0py commented Oct 10, 2024

@sixsixcoder Can you solve the conflicts with the main branch and update the branch as well? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants