-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The final input to the LLM is still a continuous feature representation for i2t #14
Comments
The input for visual understanding is obtained by quantizing the continuous features. Since the representations are drawn from a fixed codebook, it functions as a discrete tokenizer. This discretization process through VQ ensures we work with discrete tokens rather than continuous features. |
However, you are still using the LLAVA architecture with an adapter, which is notably different from purely discrete methods like EMU3 and Chameleon. am I right? |
Both the input and output are discrete in this approach. The linear adapter is just one way to process these discrete signals. |
The key difference lies in the approach to obtaining the input image embedding. While you are re-using the codebook entry for this purpose, emu3 and chameleon do not follow this method. Instead, they input discrete tokens directly into the embedding layers, similar to how it is done with text modality. |
- This work focuses on the tokenizer itself. We have chosen the Llava framework, which has high cost-effectiveness and reproducibility, to verify the vision undestanding ability. If it is used for unified modeling, of course, you can also directly use our tokenizer in the way of Chameleon Emu, using the exact same discrete way and using newly initialized embeddings, which is fully supported. |
Hi, It seems that for visual understanding, although the model uses VQ to discretize the encoding of images, the final input to the LLM is still a continuous feature representation.
I doubt whether it can still be called a discrete tokenizer.
The text was updated successfully, but these errors were encountered: