Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about PCA step #16

Closed
2811668688 opened this issue Dec 26, 2024 · 5 comments
Closed

Confusion about PCA step #16

2811668688 opened this issue Dec 26, 2024 · 5 comments

Comments

@2811668688
Copy link

I hope this message finds you well. I have a couple of questions regarding your methodology for PCA

(1) When performing PCA, did you apply any additional processing steps to the embedding data before conducting PCA?

(2) Regarding the elimination of the modality gap, is it possible to obtain the image Fig 3(b) by simply using the "one word:" method, or is further fine-tuning required to achieve this outcome?

Thank you very much for your time.

@kongds
Copy link
Owner

kongds commented Dec 26, 2024

Thank you for your interest in our work.
1. We did not apply any additional processing steps.
2. The results were obtained simply by using the prompt, without fine-tuning.

For more information, please refer to #13 (comment).

@2811668688
Copy link
Author

Thanks for your reply so much.

I still have some confusions that I'd like to discuss with you. The paper mentioned that if the "in one word:" method is not used, there will be a Modality Gap in the drawn graph. So, what prompt is used when it is not employed?

Besides, I tried using the "in one word:" method on other MLLMs. Although the output texts for both modalities are similar words, there is still a Modality Gap when extracting hidden states for plotting. However, when directly testing the Recall value with the embeddings of the two modalities, there is a good effect. I would like to ask the authors if you have tried other MLLMs with this prompt in your experiments and whether it can also eliminate the Modality Gap effect? I have tried many prompts, but the Modality Gap still exists.

Thank you very much for your time. Wish you good luck today.

@kongds
Copy link
Owner

kongds commented Dec 30, 2024

For the method without prompt, we use the same chat template of llame 3 like following:

template = '<|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n \n'
template.format('<image>') # image
template.format('<sent>') # sentence

We have not plotted the gap on other MLLMs. Could you share the MLLMs and prompts that were used?

@2811668688
Copy link
Author

Thanks for your reply! I used a similar prompt to yours and I solved my problems today.

The main issue for me lies in the "one-word summary". I found that for multimodal data and their corresponding captions, my MLLM respectively summarized them as verbs and nouns, and the semantic distance between these two is quite large. Using in-context learning for the sentence summary process, I also obtained a graph similar to yours.

I think that when using this prompt for semantic compression, in-context learning might also be necessary to ensure that more MLLMs can achieve the results described in your paper. Looking forward to further communication with you.

@kongds
Copy link
Owner

kongds commented Dec 30, 2024

Thank you for your advice.

We also found that in-context learning is useful for this prompt in PromptEOL. We conducted a more in-depth analysis of in-context learning, focusing on how to select examples and scale with model size.
However, we haven’t tried this for MLLM.

@kongds kongds closed this as completed Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants