Confusion about PCA step #16

2811668688 · 2024-12-26T11:21:36Z

I hope this message finds you well. I have a couple of questions regarding your methodology for PCA

(1) When performing PCA, did you apply any additional processing steps to the embedding data before conducting PCA?

(2) Regarding the elimination of the modality gap, is it possible to obtain the image Fig 3(b) by simply using the "one word:" method, or is further fine-tuning required to achieve this outcome?

Thank you very much for your time.

kongds · 2024-12-26T17:01:33Z

Thank you for your interest in our work.
1. We did not apply any additional processing steps.
2. The results were obtained simply by using the prompt, without fine-tuning.

For more information, please refer to #13 (comment).

2811668688 · 2024-12-29T07:26:01Z

Thanks for your reply so much.

I still have some confusions that I'd like to discuss with you. The paper mentioned that if the "in one word:" method is not used, there will be a Modality Gap in the drawn graph. So, what prompt is used when it is not employed?

Besides, I tried using the "in one word:" method on other MLLMs. Although the output texts for both modalities are similar words, there is still a Modality Gap when extracting hidden states for plotting. However, when directly testing the Recall value with the embeddings of the two modalities, there is a good effect. I would like to ask the authors if you have tried other MLLMs with this prompt in your experiments and whether it can also eliminate the Modality Gap effect? I have tried many prompts, but the Modality Gap still exists.

Thank you very much for your time. Wish you good luck today.

kongds · 2024-12-30T11:34:16Z

For the method without prompt, we use the same chat template of llame 3 like following:

template = '<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n'
template.format('<image>') # image
template.format('<sent>') # sentence

We have not plotted the gap on other MLLMs. Could you share the MLLMs and prompts that were used?

2811668688 · 2024-12-30T12:18:58Z

Thanks for your reply! I used a similar prompt to yours and I solved my problems today.

The main issue for me lies in the "one-word summary". I found that for multimodal data and their corresponding captions, my MLLM respectively summarized them as verbs and nouns, and the semantic distance between these two is quite large. Using in-context learning for the sentence summary process, I also obtained a graph similar to yours.

I think that when using this prompt for semantic compression, in-context learning might also be necessary to ensure that more MLLMs can achieve the results described in your paper. Looking forward to further communication with you.

kongds · 2024-12-30T12:57:52Z

Thank you for your advice.

We also found that in-context learning is useful for this prompt in PromptEOL. We conducted a more in-depth analysis of in-context learning, focusing on how to select examples and scale with model size.
However, we haven’t tried this for MLLM.

kongds closed this as completed Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about PCA step #16

Confusion about PCA step #16

2811668688 commented Dec 26, 2024

kongds commented Dec 26, 2024

2811668688 commented Dec 29, 2024

kongds commented Dec 30, 2024

2811668688 commented Dec 30, 2024

kongds commented Dec 30, 2024

Confusion about PCA step #16

Confusion about PCA step #16

Comments

2811668688 commented Dec 26, 2024

kongds commented Dec 26, 2024

2811668688 commented Dec 29, 2024

kongds commented Dec 30, 2024

2811668688 commented Dec 30, 2024

kongds commented Dec 30, 2024