The final input to the LLM is still a continuous feature representation for i2t #14

tianzhangwu · 2025-01-06T06:34:33Z

Hi， It seems that for visual understanding, although the model uses VQ to discretize the encoding of images, the final input to the LLM is still a continuous feature representation.

I doubt whether it can still be called a discrete tokenizer.

QuLiao1117 · 2025-01-06T06:45:38Z

Hi， It seems that for visual understanding, although the model uses VQ to discretize the encoding of images, the final input to the LLM is still a continuous feature representation.

I doubt whether it can still be called a discrete tokenizer.

The input for visual understanding is obtained by quantizing the continuous features. Since the representations are drawn from a fixed codebook, it functions as a discrete tokenizer. This discretization process through VQ ensures we work with discrete tokens rather than continuous features.

tianzhangwu · 2025-01-06T06:51:45Z

However, you are still using the LLAVA architecture with an adapter, which is notably different from purely discrete methods like EMU3 and Chameleon.

am I right?

QuLiao1117 · 2025-01-06T07:05:40Z

However, you are still using the LLAVA architecture with an adapter, which is notably different from purely discrete methods like EMU3 and Chameleon.

am I right?

Both the input and output are discrete in this approach. The linear adapter is just one way to process these discrete signals.

tianzhangwu · 2025-01-06T09:00:08Z

The key difference lies in the approach to obtaining the input image embedding. While you are re-using the codebook entry for this purpose, emu3 and chameleon do not follow this method. Instead, they input discrete tokens directly into the embedding layers, similar to how it is done with text modality.

muxizju · 2025-01-06T15:13:54Z

The key difference lies in the approach to obtaining the input image embedding. While you are re-using the codebook entry for this purpose, emu3 and chameleon do not follow this method. Instead, they input discrete tokens directly into the embedding layers, similar to how it is done with text modality.

- This work focuses on the tokenizer itself. We have chosen the Llava framework, which has high cost-effectiveness and reproducibility, to verify the vision undestanding ability. If it is used for unified modeling, of course, you can also directly use our tokenizer in the way of Chameleon Emu, using the exact same discrete way and using newly initialized embeddings, which is fully supported.
- It is also possible to follow the example of Llava, use a separate codebook, and train on the basis of the pre-trained LLM with an MLP. This way is less costly. This approach does not change the essence of discretization. We still use the cross-entropy loss to predict the index of the image instead of regressing continuous features.
- The contribution lies in the fact that this tokenizer brings strong semantic comprehension ability (even if the codebook of the tokenizer is discarded and newly initialized LLM embeddings are used). This can be seen from the effects of using VQGAN and TokenFlow on Llava in the paper. The key to semantic information lies in the mapping relationship. We have conducted some experiments where a completely newly initialized learnable codebook is used for comprehension tasks, and there is not much difference in the metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The final input to the LLM is still a continuous feature representation for i2t #14

The final input to the LLM is still a continuous feature representation for i2t #14

tianzhangwu commented Jan 6, 2025

QuLiao1117 commented Jan 6, 2025

tianzhangwu commented Jan 6, 2025

QuLiao1117 commented Jan 6, 2025 •

edited

Loading

tianzhangwu commented Jan 6, 2025

muxizju commented Jan 6, 2025

The final input to the LLM is still a continuous feature representation for i2t #14

The final input to the LLM is still a continuous feature representation for i2t #14

Comments

tianzhangwu commented Jan 6, 2025

QuLiao1117 commented Jan 6, 2025

tianzhangwu commented Jan 6, 2025

QuLiao1117 commented Jan 6, 2025 • edited Loading

tianzhangwu commented Jan 6, 2025

muxizju commented Jan 6, 2025

QuLiao1117 commented Jan 6, 2025 •

edited

Loading