Naive question on IMAGE_TOKEN_INDEX
and Number of Patches
#1032
Unanswered
eiespambox
asked this question in
Q&A
Replies: 1 comment
-
I think the Consider the input prompt in the following example code [source]:
In this case, the |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Given the default settings, I think there are 576 image patches that's processed by the clip encoder and multimodal projector.
This generates a
batch x 576 x 1024
matrix output from the multimodal projector.However the code seem to only have one
IMAGE_TOKEN_INDEX
in the input_ids. I can't seem to find the code snippet that aligns all patch embeddings into that single input id that represent the image. Can someone help me understand this?My naive understanding expected there to be 576
IMAGE_TOKEN_INDEX
prepended to the prompt so that the following attention layers can attend to the different patches.Thanks!
Beta Was this translation helpful? Give feedback.
All reactions