fix llava/llava next issue when working with AutoProcessor #1674

sywangyi · 2024-12-30T08:10:01Z

crash if running following script

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi

adapt_transformers_to_gaudi()


model_id = "llava-hf/llava-1.5-7b-hf"
#model_id = "llava-hf/llama3-llava-next-8b-hf"
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
)
model = model.to("hpu")
model.generation_config.pad_token_id = model.generation_config.eos_token_id
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

# Define a chat histiry and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image")
conversation = [
    {

      "role": "user",
      "content": [
          {"type": "text", "text": "What are these?"},
          {"type": "image"},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt')
for t in inputs:
    if torch.is_tensor(inputs[t]):
        inputs[t] = inputs[t].to("hpu")
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi · 2024-12-30T08:11:00Z

for llava:
Traceback (most recent call last):
File "/workspace/wangyi/optimum-habana/test.py", line 42, in
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1468, in generate
result = self._sample(
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2449, in _sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/models/llava/modeling_llava.py", line 183, in forward
outputs = self.language_model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
return inner()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
result = forward_call(*args, **kwargs)
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/models/llama/modeling_llama.py", line 1371, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1847, in _call_impl
return inner()
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1793, in inner
result = forward_call(*args, **kwargs)
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/models/llama/modeling_llama.py", line 1226, in forward
htcore.mark_step()
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/utils/internal.py", line 36, in lazy_wrapper
func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/core/step_closure.py", line 71, in mark_step
htcore._mark_step(device_str, sync)
RuntimeError: The expanded size of the tensor (331776) must match the existing size (576) at non-singleton dimension 0. Target sizes: [331776, 4096]. Tensor sizes: [576, 4096]

sywangyi · 2024-12-30T08:11:26Z

for llava_next:
Traceback (most recent call last):
File "/workspace/wangyi/optimum-habana/test.py", line 43, in
output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/generation/utils.py", line 1468, in generate
result = self._sample(
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/generation/utils.py", line 2440, in _sample
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/models/llava_next/modeling_llava_next.py", line 342, in prepare_inputs_for_generation
self._merge_input_ids_with_image_features(
File "/workspace/wangyi/optimum-habana/optimum/habana/transformers/models/llava_next/modeling_llava_next.py", line 213, in _merge_input_ids_with_image_features
raise ValueError(
ValueError: The input provided to the model are wrong. The number of image tokens is 2340 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

HuggingFaceDocBuilderDev · 2024-12-30T08:13:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sywangyi · 2024-12-30T08:17:26Z

see huggingface/transformers@a29eabd#diff-178e0eb55a6cf5582be26cd37dc04bdef1f03fe17a5ce8d73521e19cca43dc9fR417 that move "expand input" to Autoprocessor

sywangyi · 2024-12-30T08:17:39Z

@lkk12014402 @yuanwu2017 please help review

nprotasov · 2025-01-08T13:28:09Z

optimum/habana/transformers/models/llava/modeling_llava.py

+        legacy_processing = (
+            (input_ids == self.config.image_token_index).sum(1).max() < self.config.image_seq_length
+        ) or ((input_ids.shape[-1] == 1 if token_idx is None else token_idx == 1) and pixel_values is not None)
+        if token_idx is not None and pixel_values is not None and legacy_processing:


Does it affect static shapes optimization?

fix llava/llava next issue when working with AutoProcessor

1289e4c

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi requested a review from regisss as a code owner December 30, 2024 08:10

sywangyi mentioned this pull request Jan 8, 2025

add padding to input for mllama/paligemma/idefices2 #1671

Open

3 tasks

nprotasov reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix llava/llava next issue when working with AutoProcessor #1674

fix llava/llava next issue when working with AutoProcessor #1674

sywangyi commented Dec 30, 2024

sywangyi commented Dec 30, 2024

sywangyi commented Dec 30, 2024

HuggingFaceDocBuilderDev commented Dec 30, 2024

sywangyi commented Dec 30, 2024

sywangyi commented Dec 30, 2024

nprotasov Jan 8, 2025

fix llava/llava next issue when working with AutoProcessor #1674

Are you sure you want to change the base?

fix llava/llava next issue when working with AutoProcessor #1674

Conversation

sywangyi commented Dec 30, 2024

sywangyi commented Dec 30, 2024

sywangyi commented Dec 30, 2024

HuggingFaceDocBuilderDev commented Dec 30, 2024

sywangyi commented Dec 30, 2024

sywangyi commented Dec 30, 2024

nprotasov Jan 8, 2025

Choose a reason for hiding this comment