A shortcut-aware benchmark for spatio-temporal and intuitive physics video understanding (VideoQA) using minimally different video pairs.
To enable reproducible evaluation, we utilize the lmms-eval library, which is referenced as a submodule. First clone this repo using the flag --recurse-submodules
which will automatically setup the required sumbodules. Alternatively, you will need to manually run git sumbodule init
after cloning.
Next, navigate into the root directory of the repository and create your conda environment,
make .env.init
The annotations are released at facebook/minimal_video_pairs on Huggingface Datasets. We provide scripts in this repository for downloading the videos in Makefile, make sure to go and accept all data license requirements for each data source before attempting to download. Next log into Huggingface (this is needed to download the Vinoground subset)
huggingface-cli login
Now you can download all videos from their original data sources,
make download_videos
This will create a videos
folder with 9 subfolders for different data sources which are used to create the subsets:
Subset | Data sources |
---|---|
Human object interactions | PerceptionTest, SomethingSomethingV2 |
Robot object interactions | Language Table |
Intuitive Physics and collisions | IntPhys, InfLevel, GRASP, CLEVRER |
Temporal Reasoning | STAR, Vinoground |
As previously mentioned, we utilize the lmms-eval library to enable reproducible evaluation.
We have provided the task files you need to run mvp
and mvp_mini
, mvp_mini
is essentially a smaller, balanced evaluation set with 9k examples for enabling faster evaluations.
To run the evals:
- Copy the task files at tasks/mvp folder files to lmms-evals/lmms_evals/tasks/
- Ensure videos are downloaded in
videos
folder in root of this repository - Run the evaluations with task names
mvp
andmvp_mini
. You can also run individual subsets.
We primarily report paired_accuracy
. An example in mvp
consists two QA examples, with identical question and answer options (A or B) but the video is differrent and the correct option is different (A is correct for video1 and B for video2). For paired_accuracy
, a model only gets a correct score (+1) if it gets both questions correct.
We have setup a leaderboard as part of Physical World Models release from FAIR on Huggingface: Physical Reasoning Leaderboard. To submit the results of your model on our leaderboard, combine the mvp_[mini]_{task}.jsonl
in ./logs/{model}
folder and upload with the specifics of your run.
cat submissions/mvp_*.jsonl > mvp_submission.jsonl
We are grateful to the many open-source datasets on top of which we built our benchmark: Perception Test, Something Something v2, CLEVRER, Language Table, IntPhys, InfLevel, GRASP, STAR, Vinoground.
If you find this repository useful in your research, please consider giving a star ⭐ and a citation, and make sure to cite the original video data sources referenced above as well
@article{krojer2025shortcut,
title={A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs}
author={Benno Krojer and Mojtaba Komeili and Candace Ross and Quentin Garrido and Koustuv Sinha and Nicolas Ballas and Mahmoud Assran},
journal={arXiv},
year={2025}
}
We release this benchmark under the LICENSE file in the root directory of this source tree.