The training data should be provided in a CSV file with the following format:
/absolute/path/to/image1.jpg, caption1, num_of_frames
/absolute/path/to/image2.jpg, caption2, num_of_frames
This dataset comprises 130M text-video pairs. You can download the dataset and prepare it for training according to the dataset repository's instructions. There is a README.md file in the Google Drive link that provides instructions on how to download and cut the videos. For this version, we directly use the dataset provided by the authors.
You can use ImageNet and UCF101 for a quick demo. After downloading the datasets, you can use the following command to prepare the csv file for the dataset:
# ImageNet
python -m tools.datasets.convert_dataset imagenet IMAGENET_FOLDER --split train
# UCF101
python -m tools.datasets.convert_dataset ucf101 UCF101_FOLDER --split videos
We provide csvutils.py
to manage the CSV files. You can use the following commands to process the CSV files:
# generate DATA_fmin_128_fmax_256.csv with frames between 128 and 256
python -m tools.datasets.csvutil DATA.csv --fmin 128 --fmax 256
# generate DATA_root.csv with absolute path
python -m tools.datasets.csvutil DATA.csv --root /absolute/path/to/dataset
# remove videos with no captions
python -m tools.datasets.csvutil DATA.csv --remove-empty-caption
# compute the number of frames for each video
python -m tools.datasets.csvutil DATA.csv --relength
# remove caption prefix
python -m tools.datasets.csvutil DATA.csv --remove-caption-prefix
To merge multiple CSV files, you can use the following command:
cat *csv > combined.csv