-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem in Reproducing MMMU benchmark on llava-v1.5-7b #433
Comments
The same situation, my result is around 36, any quick fix? |
Maybe you can try vlmevalkit, i used it to reproduce 35.1.
|
Thank you for your quick response! By the way, have you come across any other mismatched results with lmms-eval? Actually, I’ve encountered a few mismatched results with lmms-eval, some minor and some more significant. Thanks again! |
I remember testing mathvista and llava_w, and the results were consistent. However, I use vlmevalkit the most because many papers mention the use it. And I have only learned VLM for a few months, so I am also a beginner hhh. |
Thanks a lot! I'll check it~ |
And I think you should pay attention to whether the test method in the paper is "official code" or "vlmevalkit" or "lmms" tools. Try to choose the same one as the paper. Generally speaking, you can reproduce almost the same results using the latter two tools. Because you can see that the official results and the results of using these two tools are not completely consistent, with a difference of a few tenths of a percent, but in many cases, the methods proposed by vlm have not made much progress. In summary, I think you still need to find a way to reproduce the same results. You can consider it from the perspectives of transformers version, gpt version, cuda version, and torch version.
|
I think perhaps, when the diff is relatively little, using the same testing method(so-called official codes or general evaluating tools like lmms-eval) to your method/model and the baselines should be considered a relatively acceptable approach:) And I acknowledge what you said about "transformers version, gpt version, cuda version, and torch version", but unfortunately sometimes some of them are hard to align to be the same( Best, |
yes,i think when the diff is little,using the same method can be accepted, because a lot of papers do this. and i think many diff are caused by gpt version, like gpt-4-vision-preview is deprecated. |
title : I have a problem in reproducing MMMU benchmark
description:
I follow the guide to reproduce your result on MMMU by lmms-eval.
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip install -e .
git clone https://github.com/haotian-liu/LLaVA
cd LLaVA
pip install -e .
what's more, i also follow the guide in the miscs/repr_scripts.sh.However, i get the result 36.22(As the result you produced is 35.3 in result excel) in the val split. The picture is below.
While your result is:
Is there any problem?
The text was updated successfully, but these errors were encountered: