Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433

lyklly · 2024-11-29T13:08:03Z

title : I have a problem in reproducing MMMU benchmark
description:
I follow the guide to reproduce your result on MMMU by lmms-eval.

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA
pip install -e .

what's more, i also follow the guide in the miscs/repr_scripts.sh.However, i get the result 36.22(As the result you produced is 35.3 in result excel) in the val split. The picture is below.

While your result is:

Is there any problem?

Ivesfu · 2024-12-21T09:03:37Z

The same situation, my result is around 36, any quick fix?

lyklly · 2024-12-21T09:12:31Z

Maybe you can try vlmevalkit, i used it to reproduce 35.1.

The same situation, my result is around 36, any quick fix?

Ivesfu · 2024-12-21T09:27:46Z

Maybe you can try vlmevalkit, i used it to reproduce 35.1.

The same situation, my result is around 36, any quick fix?

Thank you for your quick response!

By the way, have you come across any other mismatched results with lmms-eval?

Actually, I’ve encountered a few mismatched results with lmms-eval, some minor and some more significant.
However, I’m not sure to what extent these discrepancies are acceptable and which ones might indicate a real issue.
Could you please provide some guidance on this? As a beginner, I would greatly appreciate any guidance or advice you could offer. Your insights would be incredibly helpful to me.

Thanks again!

lyklly · 2024-12-21T09:33:11Z

I remember testing mathvista and llava_w, and the results were consistent. However, I use vlmevalkit the most because many papers mention the use it. And I have only learned VLM for a few months, so I am also a beginner hhh.

Ivesfu · 2024-12-21T09:36:32Z

I remember testing mathvista and llava_w, and the results were consistent. However, I use vlmevalkit the most because many papers mention the use it. And I have only learned VLM for a few months, so I am also a beginner hhh.

Thanks a lot! I'll check it~

lyklly · 2024-12-21T09:40:07Z

And I think you should pay attention to whether the test method in the paper is "official code" or "vlmevalkit" or "lmms" tools. Try to choose the same one as the paper. Generally speaking, you can reproduce almost the same results using the latter two tools. Because you can see that the official results and the results of using these two tools are not completely consistent, with a difference of a few tenths of a percent, but in many cases, the methods proposed by vlm have not made much progress. In summary, I think you still need to find a way to reproduce the same results. You can consider it from the perspectives of transformers version, gpt version, cuda version, and torch version.

也许你可以尝试vlmevalkit，我用它来重现 35.1。

同样的情况，我的结果在 36 左右，有什么快速解决办法吗？

感谢您的快速回复！

顺便问一下，您是否遇到过任何其他与 lmms-eval 不匹配的结果？

实际上，我在使用 lmms-eval 时遇到了一些不匹配的结果，有些不重要，有些则比较严重。但是，我不确定这些差异在多大程度上是可以接受的，哪些可能表明存在真正的问题。您能否就此提供一些指导？作为初学者，如果您能提供任何指导或建议，我将不胜感激。您的见解对我非常有帮助。

再次感谢！

Ivesfu · 2024-12-21T09:49:46Z

And I think you should pay attention to whether the test method in the paper is "official code" or "vlmevalkit" or "lmms" tools. Try to choose the same one as the paper. Generally speaking, you can reproduce almost the same results using the latter two tools. Because you can see that the official results and the results of using these two tools are not completely consistent, with a difference of a few tenths of a percent, but in many cases, the methods proposed by vlm have not made much progress. In summary, I think you still need to find a way to reproduce the same results. You can consider it from the perspectives of transformers version, gpt version, cuda version, and torch version.

I think perhaps, when the diff is relatively little, using the same testing method(so-called official codes or general evaluating tools like lmms-eval) to your method/model and the baselines should be considered a relatively acceptable approach:)

And I acknowledge what you said about "transformers version, gpt version, cuda version, and torch version", but unfortunately sometimes some of them are hard to align to be the same(

Best,

lyklly · 2024-12-21T09:57:28Z

yes，i think when the diff is little,using the same method can be accepted, because a lot of papers do this. and i think many diff are caused by gpt version, like gpt-4-vision-preview is deprecated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433

Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433

lyklly commented Nov 29, 2024

Ivesfu commented Dec 21, 2024 •

edited

Loading

lyklly commented Dec 21, 2024

Ivesfu commented Dec 21, 2024 •

edited

Loading

lyklly commented Dec 21, 2024

Ivesfu commented Dec 21, 2024

lyklly commented Dec 21, 2024

Ivesfu commented Dec 21, 2024 •

edited

Loading

lyklly commented Dec 21, 2024

Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433

Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433

Comments

lyklly commented Nov 29, 2024

Ivesfu commented Dec 21, 2024 • edited Loading

lyklly commented Dec 21, 2024

Ivesfu commented Dec 21, 2024 • edited Loading

lyklly commented Dec 21, 2024

Ivesfu commented Dec 21, 2024

lyklly commented Dec 21, 2024

Ivesfu commented Dec 21, 2024 • edited Loading

lyklly commented Dec 21, 2024

Ivesfu commented Dec 21, 2024 •

edited

Loading

Ivesfu commented Dec 21, 2024 •

edited

Loading

Ivesfu commented Dec 21, 2024 •

edited

Loading