- Note Our Dataset is built upon four sources of datasets.
- Video-ChatGPT Video Instruction Dataset
- ActivityNet, WebVid videos
- 100K instructions
- Video Localized Narratives Dataset
- How2QA
- NextQA
- WebVid
- Video-ChatGPT Video Instruction Dataset
- 📜 Instructions: Download all of our video instructions from 🤗 SNUMPR/vlm_rlaif_datasets
Dataset Usage Filename Source of Videos SFT (short) SFT_short.json All SFT (long) SFT_long.json All Preference dataset (RM) RM_13b_v1_dataset_39k.json ANet PPO init PPO_init.json ANet RLAIF RL_data.json ANet
-
🎥 Videos: Download source videos following the instructions below, and then extract 50 frames per each video to train the model.
-
Video-ChatGPT Instruction Dataset - ActivityNet videos:
- Frames (🤗 SNUMPR/vlm_rlaif_train_anet_frames): Our version of preprocessed videos, extracted 50 frames per each video
- Videos: video as mp4 format from original paper
-
Video Localized Narratives Dataset
- See download instructions for the original dataset
- Download source video from four datasets; OoPs, OVIS, kinetics400 (UVO), kinetics
- Extract 50 frames per each video into
OOPs_50frames
,OVIS_50frames
,kinetics400_50frames
,kinetics_50frames
-
How2QA
- See download instructions to download videos
- Extract 50 frames per each video into
how2qa_50frames
-
NeXTQA
- Download Google Drive link provided by original authors to download video files
- Extract 50 frames per each video into
nextqa_50frames
-
WebVid
- Follow the official WebVid dataset README to download the videos.
- Extract 50 frames per each video into
webvid_50frames
-
# 📁 Training data folder structure
TRAIN_DATA_ROOT # (playground/data/train_dataset in default)
├── instructions
└── videos
├── anet_vidchatgpt_50frames
├── OOPs_50frames
├── OVIS_50frames
├── kinetics400_50frames
├── kinetics_50frames
├── how2qa_50frames
├── nextqa_50frames
└── webvid_50frames
// Example structure
{
'id': 'sampleid',
'src_data': 'original data source',
'conversations': [
{'role': 'human', 'value': ''},
{'role': 'gpt', 'value': ''}
]
'images': [
'video_dir/image_01.jpg',
'video_dir/image_02.jpg',
...
]
}
# 📁 Evaluation folder structure
EVAL_DATA_ROOT # (playground/data/eval_dataset in default)
├── zeroshotqa
│ ├── annotations
│ └── frames
│ ├── anet
│ ├── msvd
│ └── msrvtt
└── videogenerativebench
├── annotations
└── frames
- 🤗 SNUMPR/vlm_rlaif_eval_datasets Download our preprocessed zero-shot QA benchmark from this link.
- For original videos and test split, follow instructions from Video-LLaVA to download the Zero-shot QA dataset.
- Download evaluation dataset & videos for zero-shot question answering from Video-ChatGPT Qualitative Evaluation.
- Videos, Descriptions
- Extract 50 frames per each videos