Skip to content

cocacola-lab/Physics-RW

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 

Repository files navigation

Bridging the Reality Gap: A Benchmark for Physical Reasoning in General World Models with Various Physical Phenomena beyond Mechanics

illustration-benchmark

Physics-RW is designed for vision-language physical reasoning tasks, which is constructed from real-world videos. Encompassing a broad spectrum of real-world phenomena—mechanics, thermodynamics, electromagnetism, and optics—Physics-RW offers a comprehensive evaluation platform.


0. Contents

1. The Organized Structure of Dataset

The dataset is organized as follows.

# The data structure in the four folders, i.e., mechanics, thermodynamics, electromagnetism, and optics. They correspond to T1, T2, T3, and T4 respectively.
-- Mechanics (T1)
    -- classification
            -- video/ # The folder for storing videos used for classification tasks.
        -- classification_en.json # The JSON file contains idx, video_path, the English version of the instruction, ground-truth label and prediction.
                                  # The prediction value is empty, intended to store the model's output.
        -- classification_zh.json # Similar to the above file, but the instructions are in Chinese.

    -- video_generation
        -- video_gen/
            -- seen_video/ # The folder for storing videos input to the model.
            -- unseen_video/ # The folder for storing reference videos, i.e., subsequent videos.
        -- video_gen_en.json # The JSON file contains idx, video_path, label_path (i.e., the path of subsequent video),
                             # the English version of the instruction, and num_predicted_frame.
        -- video_gen_zh.json # Similar to the above file, but the instructions are in Chinese.
-- Thermodynamics (T2)
    ...
-- Electromagnetism (T3)
    ...
-- Optics (T4)
    ...

2. Downloading the Physics-RW dataset

Our data is stored in Hugging Face and ModelScope. Currently, only part of the data has been uploaded. Once the review process is complete, we will update all the data.

3. Benchmark Evaluation

We primarily evaluate existing methods based on accuracy (ACC), F1 score, and Fréchet Video Distance (FVD) metrics. Considering the large size of content files in video generation tasks, we provide subsequent videos for evaluation. However, for classification task types, we do not provide ground truth. Users are required to store the model-generated content in the "prediction" field of JSON files and then submit the results following the dataset structure (excluding video files). We will conduct evaluations promptly and return the assessment results. In the future, we plan to establish an evaluation website to showcase both evaluated model results and the results provided by users.

4. Baseline Models

We have evaluated the representative models, and the code is available at the following link:

Model Name Paper or Project Code Link License
LLaMA-Adapter LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention code GPL-3.0 license
Large World Model World Model on Million-Length Video and Language with Blockwise RingAttention code Apache License 2.0
VideoChat VideoChat : Chat-Centric Video Understanding code MIT License
VideoChat2 MVBench: A Comprehensive Multi-modal Video Understanding Benchmark code MIT License
Video-LLaMA Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding code BSD 3-Clause License
Video-ChatGPT Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models code CC-BY-4.0 License
Video-LLaVA Video-LLaVA: Learning United Visual Representation by Alignment Before Projection code Apache License 2.0
MiniGPT4-Video MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens code BSD 3-Clause License
Gemini 1.5 Pro Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context code Apache License 2.0
GPT-4o GPT-4o code MIT License
NExT-GPT NExT-GPT: Any-to-Any Multimodal LLM code BSD 3-Clause License
Open-Sora -------------- code Apache License 2.0

5. Inference

Here is an example of inferencing using Gemini 1.5 Pro:

python gemini_inference.py

6. Contact Us

If you have any questions, please feel free to contact us via email at [email protected] or [email protected]. (Note: For classification task submissions, please send an email to the above addresses for now. We will set up a website for submissions in the future.)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages