Physics-RW is designed for vision-language physical reasoning tasks, which is constructed from real-world videos. Encompassing a broad spectrum of real-world phenomena—mechanics, thermodynamics, electromagnetism, and optics—Physics-RW offers a comprehensive evaluation platform.
- 0. Contents
- 1. The Organized Structure of Dataset
- 2. Download Dataset
- 3. Benchmark Evaluation
- 4. Baseline Models
- 5. Inference
- 6. Contact Us
The dataset is organized as follows.
# The data structure in the four folders, i.e., mechanics, thermodynamics, electromagnetism, and optics. They correspond to T1, T2, T3, and T4 respectively.
-- Mechanics (T1)
-- classification
-- video/ # The folder for storing videos used for classification tasks.
-- classification_en.json # The JSON file contains idx, video_path, the English version of the instruction, ground-truth label and prediction.
# The prediction value is empty, intended to store the model's output.
-- classification_zh.json # Similar to the above file, but the instructions are in Chinese.
-- video_generation
-- video_gen/
-- seen_video/ # The folder for storing videos input to the model.
-- unseen_video/ # The folder for storing reference videos, i.e., subsequent videos.
-- video_gen_en.json # The JSON file contains idx, video_path, label_path (i.e., the path of subsequent video),
# the English version of the instruction, and num_predicted_frame.
-- video_gen_zh.json # Similar to the above file, but the instructions are in Chinese.
-- Thermodynamics (T2)
...
-- Electromagnetism (T3)
...
-- Optics (T4)
...
Our data is stored in Hugging Face and ModelScope. Currently, only part of the data has been uploaded. Once the review process is complete, we will update all the data.
We primarily evaluate existing methods based on accuracy (ACC), F1 score, and Fréchet Video Distance (FVD) metrics. Considering the large size of content files in video generation tasks, we provide subsequent videos for evaluation. However, for classification task types, we do not provide ground truth. Users are required to store the model-generated content in the "prediction" field of JSON files and then submit the results following the dataset structure (excluding video files). We will conduct evaluations promptly and return the assessment results. In the future, we plan to establish an evaluation website to showcase both evaluated model results and the results provided by users.
We have evaluated the representative models, and the code is available at the following link:
Here is an example of inferencing using Gemini 1.5 Pro:
python gemini_inference.py
If you have any questions, please feel free to contact us via email at [email protected] or [email protected]. (Note: For classification task submissions, please send an email to the above addresses for now. We will set up a website for submissions in the future.)