-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect token id for <|image|>
token
#219
Comments
@vancoykendall you have identified a most terrible wart in the llama3 vision model. See https://github.com/meta-llama/llama-models/blob/main/models/llama3/api/chat_format.py#L226 Essentially the <|image|> token does correspond to 128011 in the tokenizer ... however, the special token that actually got trained is the last token 128056. As to why that happened is a very esoteric reason in our training process. |
@ashwinb Gotcha, but shouldn't the downloaded checkpoint from meta contain the token embedding for the |
In this repo the Llama3 tokenizer sets the
<|image|>
special token to128011
llama-models/models/llama3/api/tokenizer.py
Lines 79 to 101 in ec6b563
However, in the tokenizer_config.json uploaded to the huggingface repo
meta-llama/Llama-3.2-11B-Vision-Instruct
, the<|image|>
token is mapped to128256
.I also checked the norms of the model's embedding layer for tokens
128011
and128256
.128011
has a norm near zero, while128256
token has a regular norm. This makes me think128256
is the correct token embedding for the<|image|>
token.The text was updated successfully, but these errors were encountered: