-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama3 tokenizer missing token ID 128011 #1995
Comments
There seems to be something strange with the torchtune/torchtune/models/llama3/_tokenizer.py Lines 28 to 38 in 4b6877a
I raised a related issue here: meta-llama/llama-models#219 |
The token id for image should be 128256 for torchtune, as that is what is used by Hugging Face where our users download the models. This was the token id that was used when initially training the model on images, but for inference the embedding was actually moved to 128011 (which was a reserved token before) so users wouldn't have to keep the embedding vectors for both 128011 and 128256 and can save on compute/memory during inference time. This may explain why you see 128011 being use when you download from meta directly using @SalmanMohammadi, since you are explicitly calling decode on 128011 you are seeing this error. But do you find this token id in a real use case? I would imagine the tokenizer would never output 128011 since it is not used for anything, and so you would never have to decode this. Still, even if we assume you can run into a wild 128011, I would imagine you should still get a random embedding vector since the embedding table is continuous. Perhaps tiktoken prevents you from doing this if the token id is not registered as a special token and it's not in the regular vocab. It may be possible that our reserved token logic is incorrect and should create a reserved token for 128011. I'll take a closer look at this and the HF tokenizer config to see if that's the case. Tagging some folks who have more knowledge on this topic to clarify anything I may have missed, @pbontrager @abhimanyudubey |
If you set |
FWIW the HF config for the 3.2 vision tokenizer reserves token |
@RdoubleA The problem with 128011 being used in the |
Confirmed with @ashwinb on the llama-models repo that 128256 is the correct ID and 128011 should not be used, due to reasons explained in the linked issue. I am not sure what HF is doing on their end, but 128011 should not be a reserved token nor should it ever come up in the wild, as the embedding for it is not trained. So @SalmanMohammadi's original error is indeed expected behavior. As to why the eleuther eval recipe test produced a 128011 when using our own tokenizer is a separate question and needs to be debugged. To provide clarity to future users that may have similar questions, I can add some comments in the code to explain why 128011 is skipped. |
Thanks @vancoykendall for helping look into this and @RdoubleA for the clear explanation. I think this occured because we use an untrained model with dummy weights in some of our recipe tests which generates this token. Closing this for now as this is a separate issue. |
I'm getting an error when decoding with Llama3 tokenizer - we seem to be missing special token ID 128011 - it's defined here for the Llama3.1 tokenizer, and here for the Llama3.2 vision tokenizer as
reserved_special_token_3
andreserved_special_token_2
, respectively, but when we setup the reserved tokens in our llama3 tokeniser, we skip 128011. Should we be adding this token ID into the reserved special tokens list?This occurs for the Llama3, Llama3.1, and Llama3.2 tokenizers.
cc @RdoubleA
The text was updated successfully, but these errors were encountered: