ChaChaBench

Understanding the observer’s motion is as essential as perceiving the scene itself. Despite advances in video-language models (VLMs), their ability to recognize the camera’s movement—independent of scene content—remains largely untested.

We evaluate whether VLMs can accurately identify basic egocentric camera motions—such as move forward, pan left, or tilt up—from short, single-motion videos. To isolate this core ability, we generate 1,000+ minimalistic, noise-free clips in Omnigibson, avoiding any object motion, lighting artifacts, or compositional complexity.

Most VLMs are pre-trained on scene-centric captions. But reasoning about space, perspective, and viewer motion is foundational for tasks like navigation, tracking, and embodiment. Our benchmark directly probes this capacity.

Gemini-2.0 & Gemini-2.5 (Flash)
Qwen2.5-VL (7B / 32B / 72B, Instruct)
NVILA-15B & LongVILA
CameraBench (SFT)

Dataset Format

12-class taxonomy: {move/tilt/pan/roll in 6DoF directions}
Videos + JSON annotations
Compatible with any vision-language pipeline

Requirements

To install requirements:

pip install -r requirements.txt

Load data from HuggingFace into data/

Evaluation

To evaluate on ChaChaBench, run:

python main.py --model_path <model_path>

Data Curation

Pre-Requisite

Please follow the OmniGibson intallation documentation. You will need a GPU > RTX2070 to run this. In omni-control/__init__.py, update the macros for gm.ASSET_PATH, gm.DATASET_PATH, and gm.KEY_PATH, based on your installation directories.

After installation, To create dataset using OmniGibson on the BEHAVIOR-1K Dataset, run:

python -m omni-control.generate_single_command_data

You can change the location output directory through the omni-control/config.py, and the hard variables set inside omni-control/generate_single_command_data.

Results

Model	Avg F1 (%)
Human (reference)	97.2
Gemini-2.5-Flash (best)	35.6
Qwen2.5-VL-72B-Instruct	25.8

All models severely underperform on move backward (F1 ≈ 0–7%)
Many models conflate rotation (e.g., pan/tilt) with translation (e.g., move left/right)
Bias toward forward motion with low precision in the forward class, showing poor class discrimination
Roll, confuses even larger models

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
models		models
omni-control		omni-control
prompts		prompts
README.md		README.md
evaluate_reasoning.py		evaluate_reasoning.py
ground_truth_loader.py		ground_truth_loader.py
main.py		main.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChaChaBench

Dataset Format

Requirements

Evaluation

Data Curation

Pre-Requisite

Results

About

Releases

Packages

Contributors 2

Languages

robertocarlosjuan/ChaChaBench

Folders and files

Latest commit

History

Repository files navigation

ChaChaBench

Dataset Format

Requirements

Evaluation

Data Curation

Pre-Requisite

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages