bench: Add a benchmark for vLM: MMMU #3562

mickqian · 2025-02-14T03:05:20Z

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

python/sglang/srt/entrypoints/engine.py

yizhang2077

Hi @mickqian, I leave some comments, all of them are related to path. I also try to run bench_hf.py, but it seems to cause OOM, I am not sure if it is normal.
Here is qwen2vl and qwen2.5vl results in my env:
qwen2vl
{"Overall-Art and Design": {"num": 120, "acc": 0.317}, "Art": {"num": 30, "acc": 0.4}, "Art_Theory": {"num": 30, "acc": 0.367}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.32}, "Accounting": {"num": 30, "acc": 0.333}, "Economics": {"num": 30, "acc": 0.3}, "Finance": {"num": 30, "acc": 0.2}, "Manage": {"num": 30, "acc": 0.267}, "Marketing": {"num": 30, "acc": 0.5}, "Overall-Science": {"num": 150, "acc": 0.333}, "Biology": {"num": 30, "acc": 0.367}, "Chemistry": {"num": 30, "acc": 0.167}, "Geography": {"num": 30, "acc": 0.333}, "Math": {"num": 30, "acc": 0.433}, "Physics": {"num": 30, "acc": 0.367}, "Overall-Health and Medicine": {"num": 150, "acc": 0.38}, "Basic_Medical_Science": {"num": 30, "acc": 0.433}, "Clinical_Medicine": {"num": 30, "acc": 0.433}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.133}, "Pharmacy": {"num": 30, "acc": 0.567}, "Public_Health": {"num": 30, "acc": 0.333}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.35}, "History": {"num": 30, "acc": 0.367}, "Literature": {"num": 30, "acc": 0.367}, "Sociology": {"num": 30, "acc": 0.267}, "Psychology": {"num": 30, "acc": 0.4}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.267}, "Agriculture": {"num": 30, "acc": 0.233}, "Architecture_and_Engineering": {"num": 30, "acc": 0.3}, "Computer_Science": {"num": 30, "acc": 0.333}, "Electronics": {"num": 30, "acc": 0.167}, "Energy_and_Power": {"num": 30, "acc": 0.267}, "Materials": {"num": 30, "acc": 0.367}, "Mechanical_Engineering": {"num": 30, "acc": 0.2}, "Overall": {"num": 900, "acc": 0.323}}
qwen2.5vl
{"Overall-Art and Design": {"num": 120, "acc": 0.242}, "Art": {"num": 30, "acc": 0.2}, "Art_Theory": {"num": 30, "acc": 0.267}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.3}, "Accounting": {"num": 30, "acc": 0.467}, "Economics": {"num": 30, "acc": 0.333}, "Finance": {"num": 30, "acc": 0.1}, "Manage": {"num": 30, "acc": 0.233}, "Marketing": {"num": 30, "acc": 0.367}, "Overall-Science": {"num": 150, "acc": 0.2}, "Biology": {"num": 30, "acc": 0.133}, "Chemistry": {"num": 30, "acc": 0.133}, "Geography": {"num": 30, "acc": 0.2}, "Math": {"num": 30, "acc": 0.3}, "Physics": {"num": 30, "acc": 0.233}, "Overall-Health and Medicine": {"num": 150, "acc": 0.267}, "Basic_Medical_Science": {"num": 30, "acc": 0.233}, "Clinical_Medicine": {"num": 30, "acc": 0.167}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.2}, "Pharmacy": {"num": 30, "acc": 0.367}, "Public_Health": {"num": 30, "acc": 0.367}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.242}, "History": {"num": 30, "acc": 0.167}, "Literature": {"num": 30, "acc": 0.333}, "Sociology": {"num": 30, "acc": 0.2}, "Psychology": {"num": 30, "acc": 0.267}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.276}, "Agriculture": {"num": 30, "acc": 0.2}, "Architecture_and_Engineering": {"num": 30, "acc": 0.167}, "Computer_Science": {"num": 30, "acc": 0.233}, "Electronics": {"num": 30, "acc": 0.267}, "Energy_and_Power": {"num": 30, "acc": 0.367}, "Materials": {"num": 30, "acc": 0.433}, "Mechanical_Engineering": {"num": 30, "acc": 0.267}, "Overall": {"num": 900, "acc": 0.257}}

benchmark/mmmu/eval_utils.py

benchmark/mmmu/bench_sglang.py

benchmark/mmmu/bench_hf.py

benchmark/mmmu/README.md

yizhang2077

I think we can update How to Support a New vLM in support_models.md . Each time we post a new VLM, we need test this benchmark and compare with hf.

mickqian · 2025-02-22T01:45:53Z

Hi @mickqian, I leave some comments, all of them are related to path. I also try to run bench_hf.py, but it seems to cause OOM, I am not sure if it is normal. Here is qwen2vl and qwen2.5vl results in my env: qwen2vl {"Overall-Art and Design": {"num": 120, "acc": 0.317}, "Art": {"num": 30, "acc": 0.4}, "Art_Theory": {"num": 30, "acc": 0.367}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.32}, "Accounting": {"num": 30, "acc": 0.333}, "Economics": {"num": 30, "acc": 0.3}, "Finance": {"num": 30, "acc": 0.2}, "Manage": {"num": 30, "acc": 0.267}, "Marketing": {"num": 30, "acc": 0.5}, "Overall-Science": {"num": 150, "acc": 0.333}, "Biology": {"num": 30, "acc": 0.367}, "Chemistry": {"num": 30, "acc": 0.167}, "Geography": {"num": 30, "acc": 0.333}, "Math": {"num": 30, "acc": 0.433}, "Physics": {"num": 30, "acc": 0.367}, "Overall-Health and Medicine": {"num": 150, "acc": 0.38}, "Basic_Medical_Science": {"num": 30, "acc": 0.433}, "Clinical_Medicine": {"num": 30, "acc": 0.433}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.133}, "Pharmacy": {"num": 30, "acc": 0.567}, "Public_Health": {"num": 30, "acc": 0.333}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.35}, "History": {"num": 30, "acc": 0.367}, "Literature": {"num": 30, "acc": 0.367}, "Sociology": {"num": 30, "acc": 0.267}, "Psychology": {"num": 30, "acc": 0.4}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.267}, "Agriculture": {"num": 30, "acc": 0.233}, "Architecture_and_Engineering": {"num": 30, "acc": 0.3}, "Computer_Science": {"num": 30, "acc": 0.333}, "Electronics": {"num": 30, "acc": 0.167}, "Energy_and_Power": {"num": 30, "acc": 0.267}, "Materials": {"num": 30, "acc": 0.367}, "Mechanical_Engineering": {"num": 30, "acc": 0.2}, "Overall": {"num": 900, "acc": 0.323}} qwen2.5vl {"Overall-Art and Design": {"num": 120, "acc": 0.242}, "Art": {"num": 30, "acc": 0.2}, "Art_Theory": {"num": 30, "acc": 0.267}, "Design": {"num": 30, "acc": 0.3}, "Music": {"num": 30, "acc": 0.2}, "Overall-Business": {"num": 150, "acc": 0.3}, "Accounting": {"num": 30, "acc": 0.467}, "Economics": {"num": 30, "acc": 0.333}, "Finance": {"num": 30, "acc": 0.1}, "Manage": {"num": 30, "acc": 0.233}, "Marketing": {"num": 30, "acc": 0.367}, "Overall-Science": {"num": 150, "acc": 0.2}, "Biology": {"num": 30, "acc": 0.133}, "Chemistry": {"num": 30, "acc": 0.133}, "Geography": {"num": 30, "acc": 0.2}, "Math": {"num": 30, "acc": 0.3}, "Physics": {"num": 30, "acc": 0.233}, "Overall-Health and Medicine": {"num": 150, "acc": 0.267}, "Basic_Medical_Science": {"num": 30, "acc": 0.233}, "Clinical_Medicine": {"num": 30, "acc": 0.167}, "Diagnostics_and_Laboratory_Medicine": {"num": 30, "acc": 0.2}, "Pharmacy": {"num": 30, "acc": 0.367}, "Public_Health": {"num": 30, "acc": 0.367}, "Overall-Humanities and Social Science": {"num": 120, "acc": 0.242}, "History": {"num": 30, "acc": 0.167}, "Literature": {"num": 30, "acc": 0.333}, "Sociology": {"num": 30, "acc": 0.2}, "Psychology": {"num": 30, "acc": 0.267}, "Overall-Tech and Engineering": {"num": 210, "acc": 0.276}, "Agriculture": {"num": 30, "acc": 0.2}, "Architecture_and_Engineering": {"num": 30, "acc": 0.167}, "Computer_Science": {"num": 30, "acc": 0.233}, "Electronics": {"num": 30, "acc": 0.267}, "Energy_and_Power": {"num": 30, "acc": 0.367}, "Materials": {"num": 30, "acc": 0.433}, "Mechanical_Engineering": {"num": 30, "acc": 0.267}, "Overall": {"num": 900, "acc": 0.257}}

Yes it also leads to OOM in my case. It seems to me, that it's not very easy to apply tp for hf models without introducing any third-party libraries, any suggestions?

mickqian · 2025-02-22T01:46:05Z

I think we can update How to Support a New vLM in support_models.md . Each time we post a new VLM, we need test this benchmark and compare with hf.

updated

yizhang2077 · 2025-02-22T02:12:41Z

Yes it also leads to OOM in my case. It seems to me, that it's not very easy to apply tp for hf models without introducing any third-party libraries, any suggestions?

I think it is cause by too large image_grid_thws? One of optional solution I think is skip datas with too large image_grid_thws. But I don't verify if it works.

yizhang2077 · 2025-02-22T04:42:18Z

After hf OOM has been solved, this PR can be merged. @zhaochenyang20 can you take a look about doc?

zhaochenyang20 · 2025-02-22T06:25:21Z

sure. on this know

zhaochenyang20

LGTM

zhaochenyang20 · 2025-02-22T06:27:08Z

Will merge this today. @mickqian @yizhang2077

mickqian mentioned this pull request Feb 14, 2025

feat: Support Qwen 2.5 vl #3258

Merged

4 tasks

mickqian changed the title ~~Add a new benchmark for vLM: MMMU~~ Add a benchmark for vLM: MMMU Feb 14, 2025

mickqian force-pushed the mmmu branch 2 times, most recently from e5f7884 to 21b3a5b Compare February 15, 2025 01:41

yizhang2077 mentioned this pull request Feb 15, 2025

[Bug] Qwen2-VL-7B with sglang has significant numerical calculation errors compared to HF Transformers #3106

Closed

5 tasks

yizhang2077 reviewed Feb 15, 2025

View reviewed changes

python/sglang/srt/entrypoints/engine.py Outdated Show resolved Hide resolved

mickqian force-pushed the mmmu branch 2 times, most recently from 772ce3b to 67f81f1 Compare February 16, 2025 09:32

yizhang2077 mentioned this pull request Feb 16, 2025

solve the mrope position bugs for qwen2-vl #3605

Open

6 tasks

mickqian force-pushed the mmmu branch from 67f81f1 to 739a4d0 Compare February 16, 2025 13:53

mickqian marked this pull request as ready for review February 16, 2025 13:54

mickqian requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners February 16, 2025 13:54

mickqian force-pushed the mmmu branch from 739a4d0 to 6744c14 Compare February 17, 2025 01:16

mickqian changed the title ~~Add a benchmark for vLM: MMMU~~ bench: Add a benchmark for vLM: MMMU Feb 18, 2025

mickqian force-pushed the mmmu branch 3 times, most recently from b88293b to 6d44c37 Compare February 21, 2025 11:16

yizhang2077 mentioned this pull request Feb 21, 2025

Qwen2.5 VL sglang's output much worse than transformers #3746

Open

yizhang2077 reviewed Feb 21, 2025

View reviewed changes

benchmark/mmmu/eval_utils.py Show resolved Hide resolved

benchmark/mmmu/bench_sglang.py Show resolved Hide resolved

benchmark/mmmu/bench_hf.py Outdated Show resolved Hide resolved

benchmark/mmmu/README.md Show resolved Hide resolved

yizhang2077 reviewed Feb 21, 2025

View reviewed changes

mickqian force-pushed the mmmu branch from 56c3c4d to 69bf7cc Compare February 22, 2025 01:42

mickqian force-pushed the mmmu branch from 69bf7cc to 21c8b9b Compare February 22, 2025 01:52

mickqian force-pushed the mmmu branch from 21c8b9b to c6f9ed6 Compare February 22, 2025 03:54

mickqian force-pushed the mmmu branch from c6f9ed6 to 15c25d1 Compare February 22, 2025 05:23

zhaochenyang20 approved these changes Feb 22, 2025

View reviewed changes

mickqian force-pushed the mmmu branch from cc497c9 to a67279d Compare February 22, 2025 07:41

add MMMU - a benchmark for vLM

ade3862

mickqian force-pushed the mmmu branch from a67279d to ade3862 Compare February 22, 2025 12:42

Merge branch 'main' into mmmu

2d093a7

zhaochenyang20 approved these changes Feb 22, 2025

View reviewed changes

zhaochenyang20 merged commit 45205d8 into sgl-project:main Feb 22, 2025
18 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: Add a benchmark for vLM: MMMU #3562

bench: Add a benchmark for vLM: MMMU #3562

mickqian commented Feb 14, 2025

yizhang2077 left a comment •

edited

Loading

yizhang2077 left a comment •

edited

Loading

mickqian commented Feb 22, 2025

mickqian commented Feb 22, 2025

yizhang2077 commented Feb 22, 2025

yizhang2077 commented Feb 22, 2025

zhaochenyang20 commented Feb 22, 2025

zhaochenyang20 left a comment

zhaochenyang20 commented Feb 22, 2025

bench: Add a benchmark for vLM: MMMU #3562

bench: Add a benchmark for vLM: MMMU #3562

Conversation

mickqian commented Feb 14, 2025

Motivation

Modifications

Checklist

yizhang2077 left a comment • edited Loading

Choose a reason for hiding this comment

yizhang2077 left a comment • edited Loading

Choose a reason for hiding this comment

mickqian commented Feb 22, 2025

mickqian commented Feb 22, 2025

yizhang2077 commented Feb 22, 2025

yizhang2077 commented Feb 22, 2025

zhaochenyang20 commented Feb 22, 2025

zhaochenyang20 left a comment

Choose a reason for hiding this comment

zhaochenyang20 commented Feb 22, 2025

yizhang2077 left a comment •

edited

Loading

yizhang2077 left a comment •

edited

Loading