Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433

Open
lyklly opened this issue Nov 29, 2024 · 8 comments
Open

Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433

lyklly opened this issue Nov 29, 2024 · 8 comments

Comments

@lyklly
Copy link

lyklly commented Nov 29, 2024

title : I have a problem in reproducing MMMU benchmark
description:
I follow the guide to reproduce your result on MMMU by lmms-eval.

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA
pip install -e .

what's more, i also follow the guide in the miscs/repr_scripts.sh.However, i get the result 36.22(As the result you produced is 35.3 in result excel) in the val split. The picture is below.
image

While your result is:
1732885623718
Is there any problem?

@Ivesfu
Copy link

Ivesfu commented Dec 21, 2024

The same situation, my result is around 36, any quick fix?

@lyklly
Copy link
Author

lyklly commented Dec 21, 2024

Maybe you can try vlmevalkit, i used it to reproduce 35.1.

The same situation, my result is around 36, any quick fix?

@Ivesfu
Copy link

Ivesfu commented Dec 21, 2024

Maybe you can try vlmevalkit, i used it to reproduce 35.1.

The same situation, my result is around 36, any quick fix?

Thank you for your quick response!

By the way, have you come across any other mismatched results with lmms-eval?

Actually, I’ve encountered a few mismatched results with lmms-eval, some minor and some more significant.
However, I’m not sure to what extent these discrepancies are acceptable and which ones might indicate a real issue.
Could you please provide some guidance on this? As a beginner, I would greatly appreciate any guidance or advice you could offer. Your insights would be incredibly helpful to me.

Thanks again!

@lyklly
Copy link
Author

lyklly commented Dec 21, 2024

I remember testing mathvista and llava_w, and the results were consistent. However, I use vlmevalkit the most because many papers mention the use it. And I have only learned VLM for a few months, so I am also a beginner hhh.

@Ivesfu
Copy link

Ivesfu commented Dec 21, 2024

I remember testing mathvista and llava_w, and the results were consistent. However, I use vlmevalkit the most because many papers mention the use it. And I have only learned VLM for a few months, so I am also a beginner hhh.

Thanks a lot! I'll check it~

@lyklly
Copy link
Author

lyklly commented Dec 21, 2024

And I think you should pay attention to whether the test method in the paper is "official code" or "vlmevalkit" or "lmms" tools. Try to choose the same one as the paper. Generally speaking, you can reproduce almost the same results using the latter two tools. Because you can see that the official results and the results of using these two tools are not completely consistent, with a difference of a few tenths of a percent, but in many cases, the methods proposed by vlm have not made much progress. In summary, I think you still need to find a way to reproduce the same results. You can consider it from the perspectives of transformers version, gpt version, cuda version, and torch version.

也许你可以尝试vlmevalkit,我用它来重现 35.1。

同样的情况,我的结果在 36 左右,有什么快速解决办法吗?

感谢您的快速回复!

顺便问一下,您是否遇到过任何其他与 lmms-eval 不匹配的结果?

实际上,我在使用 lmms-eval 时遇到了一些不匹配的结果,有些不重要,有些则比较严重。 但是,我不确定这些差异在多大程度上是可以接受的,哪些可能表明存在真正的问题。 您能否就此提供一些指导?作为初学者,如果您能提供任何指导或建议,我将不胜感激。您的见解对我非常有帮助。

再次感谢!

@Ivesfu
Copy link

Ivesfu commented Dec 21, 2024

And I think you should pay attention to whether the test method in the paper is "official code" or "vlmevalkit" or "lmms" tools. Try to choose the same one as the paper. Generally speaking, you can reproduce almost the same results using the latter two tools. Because you can see that the official results and the results of using these two tools are not completely consistent, with a difference of a few tenths of a percent, but in many cases, the methods proposed by vlm have not made much progress. In summary, I think you still need to find a way to reproduce the same results. You can consider it from the perspectives of transformers version, gpt version, cuda version, and torch version.

I think perhaps, when the diff is relatively little, using the same testing method(so-called official codes or general evaluating tools like lmms-eval) to your method/model and the baselines should be considered a relatively acceptable approach:)

And I acknowledge what you said about "transformers version, gpt version, cuda version, and torch version", but unfortunately sometimes some of them are hard to align to be the same(

Best,

@lyklly
Copy link
Author

lyklly commented Dec 21, 2024

yes,i think when the diff is little,using the same method can be accepted, because a lot of papers do this. and i think many diff are caused by gpt version, like gpt-4-vision-preview is deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants