DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Welcome to the official repository for DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision-Language Models. This repository contains the code, resources, and documentation supporting our paper, which introduces DynaMath: a benchmark designed to rigorously evaluate mathematical reasoning across various vision-language models (VLMs).
For further details, including the benchmark leaderboard, please visit our project website and our preprint paper.
- 2024.11.25 DynaMath has been integrated into VLMEvalKit, enabling the one-command evaluation.
The rapid advancements in Vision-Language Models (VLMs) have shown significant potential in tackling mathematical reasoning tasks that involve visual context. However, unlike humans who can reliably apply solution steps to similar problems with minor modifications, state-of-the-art VLMs such as GPT-4o often fail to maintain consistency across such variations, revealing limitations in their mathematical reasoning capabilities.
DynaMATH addresses this challenge by providing a dynamic visual math benchmark specifically designed to evaluate the mathematical reasoning robustness of VLMs. While existing vision-based math benchmarks assess VLMs' problem-solving abilities with static problem sets, they lack the ability to evaluate performance robustness under varying problem conditions.
DynaMATH bridges this gap by introducing a benchmark with 501 high-quality, multi-topic seed questions, each represented as a Python program. These programs enable automatic generation of a much larger set of concrete questions with diverse visual and textual variations, providing a comprehensive testbed for evaluating generalization abilities of VLMs.
Figure: Illustration of the dynamic benchmark generation process in DynaMATH.
We assessed the performance of 14 state-of-the-art VLMs using 5,010 generated concrete questions (10 variations per seed question). The results reveal that the worst-case model accuracy, defined as the percentage of correctly answered seed questions across all variations, is significantly lower than the average-case accuracy. Moreover, the analysis shows that model errors on certain variants are not merely due to random chance, but are consistent failures that indicate underlying robustness issues. The findings highlight the need to study the robustness of VLMs' reasoning capabilities. DynaMATH offers valuable insights for developing more reliable models for mathematical reasoning tasks.
Our benchmark collection consists of two phases: Seed Question Collection and Program-based Question Generation.
-
Seed questions were selectively curated from existing visual math datasets and publicly available resources.
-
We collected:
- 107 questions from MathVista, covering topics like analytic geometry and statistics.
- 27 questions from MATH-V, focused on arithmetic, puzzles, and solid geometry.
- 45 questions based on scientific figures.
- 48 questions on graph theory from the MMMU dataset.
- 236 questions on advanced reasoning topics such as functions and geometry from publicly accessible resources.
- 38 newly developed questions covering linear algebra, set theory, and algorithmic flow.
-
After eliminating overly complex questions unsuitable for programmatic generation, the final dataset comprises 501 seed questions:
- 45.3% sourced from established visual math datasets.
- 54.7% newly collected or developed from public resources.
- Each seed question is transformed into a carefully designed Python program, enabling the generation of diverse concrete questions under randomly sampled conditions.
- 470 programs include a plotting function for dynamic visual contexts, while 31 programs use fixed images with randomized text elements. -This programmatic approach enables the creation of infinitely many concrete benchmark questions, facilitating the evaluation of VLMs' reasoning robustness.
Variant Types in DynaMath:
- Numerical Value Variants: Modifying numerical quantities to test arithmetic proficiency.
- Geometric Transformations: Changing shapes, angles, and dimensions to assess spatial understanding.
- Function Type Variants: Varying function types (e.g., linear, quadratic) to evaluate generalization.
- Color Variants: Altering colors to test robustness against superficial visual changes.
- Symbolic Substitutions: Modifying symbols to test adaptability to different mathematical representations.
- Graph Structure Variants: Changing graph layouts to evaluate comprehension of relationships.
- Real-life Context Variants: Modifying real-world scenarios (e.g., calendars, time-related problems) to test contextual understanding.
- Mathematical Topics: Covers nine topics including Solid Geometry (SG, 3.0%), Puzzle Tests (PT, 3.4%), Arithmetic (AR, 5.2%), Scientific Figures (SF, 9.0%), Graph Theory (GT, 9.6%), Algebra (AL, 10.2%), Plane Geometry (PG, 15.4%), Analytic Geometry (AG, 19.4%), and Statistics (ST, 25.0%).
- Difficulty Levels: Questions range from elementary to undergraduate level, with a focus on high school (55.3%) and undergraduate (32.1%) levels.
- Question Types: Includes 35.5% multiple-choice questions and 64.7% free-form questions.
This diverse collection of variants and topics makes DynaMath a comprehensive benchmark for evaluating the flexibility, robustness, and accuracy of VLMs in solving mathematical problems.
(New!) DynaMath has been integrated into VLMEvalKit, enabling the one-command evaluation.
Follow these steps to generate a version of the DynaMath dataset:
First, use the provided Dockerfile
to create a Docker image for the environment:
docker build -t dynamath .
Next, run the container interactively based on the created image, ensuring you mount the appropriate directories:
docker run -it -v /home/user/DynaMath:/app dynamath bash
Once inside the container, navigate to the dataset_generator
directory and generate the question variants by running:
cd dataset_generator
xvfb-run -s "-screen 0 640x480x24" python generate_json.py 1 501
This will generate a batch of questions starting from index 1 to 501.
Note: The
generate_json.py
script takes four arguments:
- Starting index
- Number of questions to generate from the starting index
- Default random seed
- Default NumPy seed
If you find DynaMath helpful for your research, please cite
@misc{zou2024dynamathdynamicvisualbenchmark,
title={DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models},
author={Chengke Zou and Xingang Guo and Rui Yang and Junyu Zhang and Bin Hu and Huan Zhang},
year={2024},
eprint={2411.00836},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.00836},
}