GitHub - cocacola-lab/Physics-RW

Bridging the Reality Gap: A Benchmark for Physical Reasoning in General World Models with Various Physical Phenomena beyond Mechanics

Physics-RW is designed for vision-language physical reasoning tasks, which is constructed from real-world videos. Encompassing a broad spectrum of real-world phenomena—mechanics, thermodynamics, electromagnetism, and optics—Physics-RW offers a comprehensive evaluation platform.

1. The Organized Structure of Dataset

The dataset is organized as follows.

# The data structure in the four folders, i.e., mechanics, thermodynamics, electromagnetism, and optics. They correspond to T1, T2, T3, and T4 respectively.
-- Mechanics (T1)
    -- classification
            -- video/ # The folder for storing videos used for classification tasks.
        -- classification_en.json # The JSON file contains idx, video_path, the English version of the instruction, ground-truth label and prediction.
                                  # The prediction value is empty, intended to store the model's output.
        -- classification_zh.json # Similar to the above file, but the instructions are in Chinese.

    -- video_generation
        -- video_gen/
            -- seen_video/ # The folder for storing videos input to the model.
            -- unseen_video/ # The folder for storing reference videos, i.e., subsequent videos.
        -- video_gen_en.json # The JSON file contains idx, video_path, label_path (i.e., the path of subsequent video),
                             # the English version of the instruction, and num_predicted_frame.
        -- video_gen_zh.json # Similar to the above file, but the instructions are in Chinese.
-- Thermodynamics (T2)
    ...
-- Electromagnetism (T3)
    ...
-- Optics (T4)
    ...

2. Downloading the Physics-RW dataset

Our data is stored in Hugging Face and ModelScope. Currently, only part of the data has been uploaded. Once the review process is complete, we will update all the data.

3. Benchmark Evaluation

We primarily evaluate existing methods based on accuracy (ACC), F1 score, and Fréchet Video Distance (FVD) metrics. Considering the large size of content files in video generation tasks, we provide subsequent videos for evaluation. However, for classification task types, we do not provide ground truth. Users are required to store the model-generated content in the "prediction" field of JSON files and then submit the results following the dataset structure (excluding video files). We will conduct evaluations promptly and return the assessment results. In the future, we plan to establish an evaluation website to showcase both evaluated model results and the results provided by users.

4. Baseline Models

We have evaluated the representative models, and the code is available at the following link:

Model Name	Paper or Project	Code Link	License
LLaMA-Adapter	LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention	code	GPL-3.0 license
Large World Model	World Model on Million-Length Video and Language with Blockwise RingAttention	code	Apache License 2.0
VideoChat	VideoChat : Chat-Centric Video Understanding	code	MIT License
VideoChat2	MVBench: A Comprehensive Multi-modal Video Understanding Benchmark	code	MIT License
Video-LLaMA	Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding	code	BSD 3-Clause License
Video-ChatGPT	Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models	code	CC-BY-4.0 License
Video-LLaVA	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	code	Apache License 2.0
MiniGPT4-Video	MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	code	BSD 3-Clause License
Gemini 1.5 Pro	Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	code	Apache License 2.0
GPT-4o	GPT-4o	code	MIT License
NExT-GPT	NExT-GPT: Any-to-Any Multimodal LLM	code	BSD 3-Clause License
Open-Sora	--------------	code	Apache License 2.0

5. Inference

Here is an example of inferencing using Gemini 1.5 Pro:

python gemini_inference.py

6. Contact Us

If you have any questions, please feel free to contact us via email at [email protected] or [email protected]. (Note: For classification task submissions, please send an email to the above addresses for now. We will set up a website for submissions in the future.)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
examples		examples
README.md		README.md
gemini_inference.py		gemini_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging the Reality Gap: A Benchmark for Physical Reasoning in General World Models with Various Physical Phenomena beyond Mechanics

0. Contents

1. The Organized Structure of Dataset

2. Downloading the Physics-RW dataset

3. Benchmark Evaluation

4. Baseline Models

5. Inference

6. Contact Us

About

Releases

Packages

Languages

cocacola-lab/Physics-RW

Folders and files

Latest commit

History

Repository files navigation

Bridging the Reality Gap: A Benchmark for Physical Reasoning in General World Models with Various Physical Phenomena beyond Mechanics

0. Contents

1. The Organized Structure of Dataset

2. Downloading the Physics-RW dataset

3. Benchmark Evaluation

4. Baseline Models

5. Inference

6. Contact Us

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages