Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The final input to the LLM is still a continuous feature representation for i2t #14

Open
tianzhangwu opened this issue Jan 6, 2025 · 5 comments

Comments

@tianzhangwu
Copy link

Hi, It seems that for visual understanding, although the model uses VQ to discretize the encoding of images, the final input to the LLM is still a continuous feature representation.

I doubt whether it can still be called a discrete tokenizer.

@QuLiao1117
Copy link
Contributor

Hi, It seems that for visual understanding, although the model uses VQ to discretize the encoding of images, the final input to the LLM is still a continuous feature representation.

I doubt whether it can still be called a discrete tokenizer.

The input for visual understanding is obtained by quantizing the continuous features. Since the representations are drawn from a fixed codebook, it functions as a discrete tokenizer. This discretization process through VQ ensures we work with discrete tokens rather than continuous features.

@tianzhangwu
Copy link
Author

However, you are still using the LLAVA architecture with an adapter, which is notably different from purely discrete methods like EMU3 and Chameleon.

am I right?

@QuLiao1117
Copy link
Contributor

QuLiao1117 commented Jan 6, 2025

However, you are still using the LLAVA architecture with an adapter, which is notably different from purely discrete methods like EMU3 and Chameleon.

am I right?

Both the input and output are discrete in this approach. The linear adapter is just one way to process these discrete signals.

@tianzhangwu
Copy link
Author

The key difference lies in the approach to obtaining the input image embedding. While you are re-using the codebook entry for this purpose, emu3 and chameleon do not follow this method. Instead, they input discrete tokens directly into the embedding layers, similar to how it is done with text modality.

@muxizju
Copy link
Contributor

muxizju commented Jan 6, 2025

The key difference lies in the approach to obtaining the input image embedding. While you are re-using the codebook entry for this purpose, emu3 and chameleon do not follow this method. Instead, they input discrete tokens directly into the embedding layers, similar to how it is done with text modality.

- This work focuses on the tokenizer itself. We have chosen the Llava framework, which has high cost-effectiveness and reproducibility, to verify the vision undestanding ability. If it is used for unified modeling, of course, you can also directly use our tokenizer in the way of Chameleon Emu, using the exact same discrete way and using newly initialized embeddings, which is fully supported.
- It is also possible to follow the example of Llava, use a separate codebook, and train on the basis of the pre-trained LLM with an MLP. This way is less costly. This approach does not change the essence of discretization. We still use the cross-entropy loss to predict the index of the image instead of regressing continuous features.
- The contribution lies in the fact that this tokenizer brings strong semantic comprehension ability (even if the codebook of the tokenizer is discarded and newly initialized LLM embeddings are used). This can be seen from the effects of using VQGAN and TokenFlow on Llava in the paper. The key to semantic information lies in the mapping relationship. We have conducted some experiments where a completely newly initialized learnable codebook is used for comprehension tasks, and there is not much difference in the metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants