Naive question on `IMAGE_TOKEN_INDEX` and Number of Patches #1032

eiespambox · 2024-01-30T18:29:12Z

eiespambox
Jan 30, 2024

Given the default settings, I think there are 576 image patches that's processed by the clip encoder and multimodal projector.
This generates a batch x 576 x 1024 matrix output from the multimodal projector.

However the code seem to only have one IMAGE_TOKEN_INDEX in the input_ids. I can't seem to find the code snippet that aligns all patch embeddings into that single input id that represent the image. Can someone help me understand this?

My naive understanding expected there to be 576 IMAGE_TOKEN_INDEX prepended to the prompt so that the following attention layers can attend to the different patches.

Thanks!

anhskrttt · 2024-05-09T11:12:27Z

anhskrttt
May 9, 2024

I think the image_token_index is used to indicate the position of an image within the tokenized input prompt.

Consider the input prompt in the following example code [source]:

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=15)
processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

In this case, the <image> token is used as a placeholder to represent the position of the image within the input prompt. When the input prompt is tokenized, each token is assigned a unique index. The image_token_index refers to the specific index assigned to the <image> token.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naive question on `IMAGE_TOKEN_INDEX` and Number of Patches #1032

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Naive question on IMAGE_TOKEN_INDEX and Number of Patches #1032

eiespambox Jan 30, 2024

Replies: 1 comment

anhskrttt May 9, 2024

Naive question on `IMAGE_TOKEN_INDEX` and Number of Patches #1032

eiespambox
Jan 30, 2024

anhskrttt
May 9, 2024