Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench: Add a benchmark for vLM: MMMU #3562

Merged
merged 2 commits into from
Feb 22, 2025
Merged

Conversation

mickqian
Copy link
Contributor

Motivation

Modifications

Checklist

@mickqian mickqian mentioned this pull request Feb 14, 2025
4 tasks
@mickqian mickqian changed the title Add a new benchmark for vLM: MMMU Add a benchmark for vLM: MMMU Feb 14, 2025
@mickqian mickqian force-pushed the mmmu branch 2 times, most recently from e5f7884 to 21b3a5b Compare February 15, 2025 01:41
@mickqian mickqian force-pushed the mmmu branch 2 times, most recently from 772ce3b to 67f81f1 Compare February 16, 2025 09:32
@mickqian mickqian marked this pull request as ready for review February 16, 2025 13:54
@mickqian mickqian changed the title Add a benchmark for vLM: MMMU bench: Add a benchmark for vLM: MMMU Feb 18, 2025
@mickqian mickqian force-pushed the mmmu branch 3 times, most recently from b88293b to 6d44c37 Compare February 21, 2025 11:16
Copy link
Collaborator

@yizhang2077 yizhang2077 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mickqian, I leave some comments, all of them are related to path. I also try to run bench_hf.py, but it seems to cause OOM, I am not sure if it is normal.
Here is qwen2vl and qwen2.5vl results in my env:
qwen2vl
{"Overall-Art and Design": {"num": 120, "acc": 0.317}, "Art": {"num": 30, "acc": 0.4}, "Art_Theory": {"num": 30, "acc": 0.367}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.32}, "Accounting": {"num": 30, "acc": 0.333}, "Economics": {"num": 30, "acc": 0.3}, "Finance": {"num": 30, "acc": 0.2}, "Manage": {"num": 30, "acc": 0.267}, "Marketing": {"num": 30, "acc": 0.5}, "Overall-Science": {"num": 150, "acc": 0.333}, "Biology": {"num": 30, "acc": 0.367}, "Chemistry": {"num": 30, "acc": 0.167}, "Geography": {"num": 30, "acc": 0.333}, "Math": {"num": 30, "acc": 0.433}, "Physics": {"num": 30, "acc": 0.367}, "Overall-Health and Medicine": {"num": 150, "acc": 0.38}, "Basic_Medical_Science": {"num": 30, "acc": 0.433}, "Clinical_Medicine": {"num": 30, "acc": 0.433}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.133}, "Pharmacy": {"num": 30, "acc": 0.567}, "Public_Health": {"num": 30, "acc": 0.333}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.35}, "History": {"num": 30, "acc": 0.367}, "Literature": {"num": 30, "acc": 0.367}, "Sociology": {"num": 30, "acc": 0.267}, "Psychology": {"num": 30, "acc": 0.4}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.267}, "Agriculture": {"num": 30, "acc": 0.233}, "Architecture_and_Engineering": {"num": 30, "acc": 0.3}, "Computer_Science": {"num": 30, "acc": 0.333}, "Electronics": {"num": 30, "acc": 0.167}, "Energy_and_Power": {"num": 30, "acc": 0.267}, "Materials": {"num": 30, "acc": 0.367}, "Mechanical_Engineering": {"num": 30, "acc": 0.2}, "Overall": {"num": 900, "acc": 0.323}}
qwen2.5vl
{"Overall-Art and Design": {"num": 120, "acc": 0.242}, "Art": {"num": 30, "acc": 0.2}, "Art_Theory": {"num": 30, "acc": 0.267}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.3}, "Accounting": {"num": 30, "acc": 0.467}, "Economics": {"num": 30, "acc": 0.333}, "Finance": {"num": 30, "acc": 0.1}, "Manage": {"num": 30, "acc": 0.233}, "Marketing": {"num": 30, "acc": 0.367}, "Overall-Science": {"num": 150, "acc": 0.2}, "Biology": {"num": 30, "acc": 0.133}, "Chemistry": {"num": 30, "acc": 0.133}, "Geography": {"num": 30, "acc": 0.2}, "Math": {"num": 30, "acc": 0.3}, "Physics": {"num": 30, "acc": 0.233}, "Overall-Health and Medicine": {"num": 150, "acc": 0.267}, "Basic_Medical_Science": {"num": 30, "acc": 0.233}, "Clinical_Medicine": {"num": 30, "acc": 0.167}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.2}, "Pharmacy": {"num": 30, "acc": 0.367}, "Public_Health": {"num": 30, "acc": 0.367}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.242}, "History": {"num": 30, "acc": 0.167}, "Literature": {"num": 30, "acc": 0.333}, "Sociology": {"num": 30, "acc": 0.2}, "Psychology": {"num": 30, "acc": 0.267}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.276}, "Agriculture": {"num": 30, "acc": 0.2}, "Architecture_and_Engineering": {"num": 30, "acc": 0.167}, "Computer_Science": {"num": 30, "acc": 0.233}, "Electronics": {"num": 30, "acc": 0.267}, "Energy_and_Power": {"num": 30, "acc": 0.367}, "Materials": {"num": 30, "acc": 0.433}, "Mechanical_Engineering": {"num": 30, "acc": 0.267}, "Overall": {"num": 900, "acc": 0.257}}

Copy link
Collaborator

@yizhang2077 yizhang2077 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can update How to Support a New vLM in support_models.md . Each time we post a new VLM, we need test this benchmark and compare with hf.

@mickqian
Copy link
Contributor Author

Hi @mickqian, I leave some comments, all of them are related to path. I also try to run bench_hf.py, but it seems to cause OOM, I am not sure if it is normal. Here is qwen2vl and qwen2.5vl results in my env: qwen2vl {"Overall-Art and Design": {"num": 120, "acc": 0.317}, "Art": {"num": 30, "acc": 0.4}, "Art_Theory": {"num": 30, "acc": 0.367}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.32}, "Accounting": {"num": 30, "acc": 0.333}, "Economics": {"num": 30, "acc": 0.3}, "Finance": {"num": 30, "acc": 0.2}, "Manage": {"num": 30, "acc": 0.267}, "Marketing": {"num": 30, "acc": 0.5}, "Overall-Science": {"num": 150, "acc": 0.333}, "Biology": {"num": 30, "acc": 0.367}, "Chemistry": {"num": 30, "acc": 0.167}, "Geography": {"num": 30, "acc": 0.333}, "Math": {"num": 30, "acc": 0.433}, "Physics": {"num": 30, "acc": 0.367}, "Overall-Health and Medicine": {"num": 150, "acc": 0.38}, "Basic_Medical_Science": {"num": 30, "acc": 0.433}, "Clinical_Medicine": {"num": 30, "acc": 0.433}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.133}, "Pharmacy": {"num": 30, "acc": 0.567}, "Public_Health": {"num": 30, "acc": 0.333}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.35}, "History": {"num": 30, "acc": 0.367}, "Literature": {"num": 30, "acc": 0.367}, "Sociology": {"num": 30, "acc": 0.267}, "Psychology": {"num": 30, "acc": 0.4}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.267}, "Agriculture": {"num": 30, "acc": 0.233}, "Architecture_and_Engineering": {"num": 30, "acc": 0.3}, "Computer_Science": {"num": 30, "acc": 0.333}, "Electronics": {"num": 30, "acc": 0.167}, "Energy_and_Power": {"num": 30, "acc": 0.267}, "Materials": {"num": 30, "acc": 0.367}, "Mechanical_Engineering": {"num": 30, "acc": 0.2}, "Overall": {"num": 900, "acc": 0.323}} qwen2.5vl {"Overall-Art and Design": {"num": 120, "acc": 0.242}, "Art": {"num": 30, "acc": 0.2}, "Art_Theory": {"num": 30, "acc": 0.267}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.3}, "Accounting": {"num": 30, "acc": 0.467}, "Economics": {"num": 30, "acc": 0.333}, "Finance": {"num": 30, "acc": 0.1}, "Manage": {"num": 30, "acc": 0.233}, "Marketing": {"num": 30, "acc": 0.367}, "Overall-Science": {"num": 150, "acc": 0.2}, "Biology": {"num": 30, "acc": 0.133}, "Chemistry": {"num": 30, "acc": 0.133}, "Geography": {"num": 30, "acc": 0.2}, "Math": {"num": 30, "acc": 0.3}, "Physics": {"num": 30, "acc": 0.233}, "Overall-Health and Medicine": {"num": 150, "acc": 0.267}, "Basic_Medical_Science": {"num": 30, "acc": 0.233}, "Clinical_Medicine": {"num": 30, "acc": 0.167}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.2}, "Pharmacy": {"num": 30, "acc": 0.367}, "Public_Health": {"num": 30, "acc": 0.367}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.242}, "History": {"num": 30, "acc": 0.167}, "Literature": {"num": 30, "acc": 0.333}, "Sociology": {"num": 30, "acc": 0.2}, "Psychology": {"num": 30, "acc": 0.267}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.276}, "Agriculture": {"num": 30, "acc": 0.2}, "Architecture_and_Engineering": {"num": 30, "acc": 0.167}, "Computer_Science": {"num": 30, "acc": 0.233}, "Electronics": {"num": 30, "acc": 0.267}, "Energy_and_Power": {"num": 30, "acc": 0.367}, "Materials": {"num": 30, "acc": 0.433}, "Mechanical_Engineering": {"num": 30, "acc": 0.267}, "Overall": {"num": 900, "acc": 0.257}}

Yes it also leads to OOM in my case. It seems to me, that it's not very easy to apply tp for hf models without introducing any third-party libraries, any suggestions?

@mickqian
Copy link
Contributor Author

I think we can update How to Support a New vLM in support_models.md . Each time we post a new VLM, we need test this benchmark and compare with hf.

updated

@yizhang2077
Copy link
Collaborator

Yes it also leads to OOM in my case. It seems to me, that it's not very easy to apply tp for hf models without introducing any third-party libraries, any suggestions?

I think it is cause by too large image_grid_thws? One of optional solution I think is skip datas with too large image_grid_thws. But I don't verify if it works.

@yizhang2077
Copy link
Collaborator

After hf OOM has been solved, this PR can be merged. @zhaochenyang20 can you take a look about doc?

@zhaochenyang20
Copy link
Collaborator

sure. on this know

Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaochenyang20
Copy link
Collaborator

Will merge this today. @mickqian @yizhang2077

@zhaochenyang20 zhaochenyang20 merged commit 45205d8 into sgl-project:main Feb 22, 2025
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants