Understanding the observer’s motion is as essential as perceiving the scene itself. Despite advances in video-language models (VLMs), their ability to recognize the camera’s movement—independent of scene content—remains largely untested.
We evaluate whether VLMs can accurately identify basic egocentric camera motions—such as move forward, pan left, or tilt up—from short, single-motion videos. To isolate this core ability, we generate 1,000+ minimalistic, noise-free clips in Omnigibson, avoiding any object motion, lighting artifacts, or compositional complexity.
Most VLMs are pre-trained on scene-centric captions. But reasoning about space, perspective, and viewer motion is foundational for tasks like navigation, tracking, and embodiment. Our benchmark directly probes this capacity.
- Gemini-2.0 & Gemini-2.5 (Flash)
- Qwen2.5-VL (7B / 32B / 72B, Instruct)
- NVILA-15B & LongVILA
- CameraBench (SFT)
- 12-class taxonomy: {move/tilt/pan/roll in 6DoF directions}
- Videos + JSON annotations
- Compatible with any vision-language pipeline
To install requirements:
pip install -r requirements.txt
Load data from HuggingFace into data/
To evaluate on ChaChaBench, run:
python main.py --model_path <model_path>
Please follow the OmniGibson intallation documentation. You will need a GPU > RTX2070 to run this.
In omni-control/__init__.py
, update the macros for gm.ASSET_PATH
, gm.DATASET_PATH
, and gm.KEY_PATH
, based on your installation directories.
After installation, To create dataset using OmniGibson on the BEHAVIOR-1K Dataset, run:
python -m omni-control.generate_single_command_data
You can change the location output directory through the omni-control/config.py
, and the hard variables set inside omni-control/generate_single_command_data
.
Model | Avg F1 (%) |
---|---|
Human (reference) | 97.2 |
Gemini-2.5-Flash (best) | 35.6 |
Qwen2.5-VL-72B-Instruct | 25.8 |
- All models severely underperform on move backward (F1 ≈ 0–7%)
- Many models conflate rotation (e.g., pan/tilt) with translation (e.g., move left/right)
- Bias toward forward motion with low precision in the forward class, showing poor class discrimination
- Roll, confuses even larger models