Code for our CVPR 2023 paper on instilling a sense of time in video-language models.
[Project Page] | [Arxiv] | [Data] | [Code] | [CVPR 2023 Video] | [CVPR 2023 Poster]
- Brief Overview
- Updates
- Installation & Setup
- Datasets
- Models
- Post-pretraining: TACT
- Evaluation: TACT
- Evaluation: Downstream Tasks
- Citation
- Acknowledgements
- We show that existing video-language models struggle to associate time order in video and language through a controlled experiment on synthetic data.
- Based on VideoCLIP, we propose TACT (Temporal Adaptation by Consistent Time-ordering), a method for temporal adaptation using this time order consistency without having to pretrain from scratch.
- We demonstrate improved zeroshot generalizability of our temporally adapted models on tasks that require higher time awareness.
- 24th March 2023: Code released.
- 11th June 2024: On our synthetic benchmark, Video-LLAMA achieves an impressive 88.33% accuracy. We will continue to add evaluation of more recent LLM models on our synthetic benchmark. TimeChat achieves 76.67%. We have added a benchmark on https://paperswithcode.com/.
Create a conda
environment and install packages as described in setup/env.md
. We recommend running python setup/check_packages.py
to check if all packages are installed correctly.
We use a combination of synthetic and real datasets to evaluate our approach. Below, you can find instructions to download and prepare the datasets. Here, we present instructions for our Synthetic dataset and the TEMPO-TL dataset.
For each dataset, we provide a .zip
file that contains (i) train-test splits, (ii) S3D features for video (at 1 FPS) that serve as input to VideoCLIP model. Use the following to download all datasets:
bash setup/download_datasets.sh /path/to/datasets/
Pass the path to folder where you want to store the datasets (e.g., ./all_data/
).
We create simple synthetic video-language pairs by stitching together a pair of events (e.g., "a red circle appears" and "a yellow circle appears") with text description connected by before/after relations. An example is shown here:
As a real dataset, we consider the TEMPO-TL dataset that similarly stitches together a pair of events in text for clips in the same video.
New datasets: In order to evaluate our approach on other (new) datasets, you need to first generate and save S3D video features. See this for an open-source feature extractor. Then, create splits, create a dataset object in package/datasets/
. Please see package/datasets/tempo.py
for reference.
We base our experiments on the VideoCLIP model from FAIR. Instructions in setup/env.md
include download of relevant checkpoints for VideoCLIP.
Checkpoint zoo: Here, we provide checkpoints for TACT adapted VideoCLIP models post-pretrained on (i) TEMPO-TL, (ii) ActivityNet, (iii) Charades, (iv) Charades-Ego.
Post-pretraining Dataset | Hyperparameters | Download link | ||
---|---|---|---|---|
TEMPO-TL | 1.0 | 1.0 | 1.0 | Link |
ActivityNet | 1.0 | 1.0 | 0.0 | Link |
Charades | 1.0 | 1.0 | 0.0 | Link |
Charades-Ego | 1.0 | 1.0 | 1.0 | Link |
To download all checkpoints in one go, run:
bash setup/download_checkpoints.sh /path/to/checkpoints/
Pass the path to folder where you want to store the checkpoints (e.g., ./all_checkpoints/
).
- Post-pretraining on TEMPO-TL dataset
Replace
python postpretrain.py --dataset tempo --eval_subset temporal_1k --no_wandb --data_root /ssd/pbagad/datasets/ --only_train
--data_root
with the path to where all your dataseta are stored. Make sure to changeentity
andproject
arguments inpostpretrain.py
to log to your own wandb account.
-
Pre-trained VideoCLIP
python postpretrain.py --dataset tempo --eval_subset temporal_1k --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 52% accuracy. -
TACT post-pretrained VideoCLIP
ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/ # For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt python postpretrain.py --dataset tempo --eval_subset temporal_1k --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/ -c $ckpt
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 66% accuracy.
The detailed results on more datasets are provided in the paper and also shown below.
-
TACT post-pretrained (on TEMPO)
ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/ # For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt python postpretrain.py --dataset synthetic --eval_subset v2.0 --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/ -c $ckpt --gpus 0
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 65% accuracy. Note that since this is tiny evaluation set, using multiple GPUs will lead to incorrect accuracies because of aggregating results across GPUs. -
TACT post-pretrained (on Charades-Ego)
ckpt=/path/to/tact/checkpoint/trained/on/Charades-Ego/ # For example, ckpt=./all_checkpoints/charadesego-hparams_1.0_1.0_1.0-epoch\=2-step\=3639.ckpt python postpretrain.py --dataset synthetic --eval_subset v2.0 --eval_split test --only_eval --no_wandb --data_root /ssd/pbagad/datasets/ -c $ckpt --gpus 0
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 85% accuracy.
To illustrate zero-shot performance of our TACT adapted model on a downstream task, we provide code to run the following evaluations.
Here, we evaluate VideoQA
on a subset of the AGQA
dataset.
An example instance from the AGQA
dataset is shown below:
Note that, to run this, you need the pre-computed S3D features for the AGQA dataset.
-
Pre-trained VideoCLIP
python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset agqa --task videoqa --no_save
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 49.9% accuracy. -
TACT post-pretrained VideoCLIP
ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/ # For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset agqa --task videoqa --no_save -c $ckpt
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 57.1% accuracy.
Here, we evaluate Action Retrieval
on a subset of the SSv2
dataset.
An example instance from the SSv2
dataset is shown below:
Note that, to run this, you need the pre-computed S3D features for the SSv2 dataset.
-
Pre-trained VideoCLIP
python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset ssv2 --task action_retrieval --no_save --split "validation-tmpl-ret-singularity"
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 3.4% mAP (metric_t2v_mAP
). -
TACT post-pretrained VideoCLIP
ckpt=/path/to/tact/checkpoint/trained/on/TEMPO/ # For example, ckpt=./all_checkpoints/tempo-hparams_1.0_1.0_1.0-epoch=27-step=8288.ckpt python downstream_zeroshot.py --data_root /ssd/pbagad/datasets/ --dataset ssv2 --task action_retrieval --no_save --split "validation-tmpl-ret-singularity" -c $ckpt
Replace
--data_root
with the path to where all your dataseta are stored. This should yield about 4.2% mAP (metric_t2v_mAP
).
The detailed results on more datasets/tasks are provided in the paper and also shown below.
If you found our work useful or relevant, please consider citing our paper:
@inproceedings{
bagad2023testoftime,
title={{T}est of {T}ime: {I}nstilling {V}ideo-{L}anguage {M}odels with a {S}ense of {T}ime},
author={Bagad, Piyush and Tapaswi, Makarand and Snoek, Cees G. M.},
booktitle={CVPR},
year={2023}
}
- We acknowledge support from the ELLIS Amsterdam Unit and the AMS Scholarhsip to Piyush as a Master's student.
- We also thank Dr. Dennis Koelma for regular help with compute infrastructure and hosting of data and models, and, we thank Dr. Hazel Doughty for useful discussions.
- We also acknowledge all relevent prior work, particularly, VideoCLIP and TEMPO, for making their code and data publicly available.
⚠️ Infra note: Our code has been run on a single node with 4 GPUs (either NVIDIA RTX A5000 or NVIDIA GeForce 1080). Running it on different infrastructures may cause differences in results. However, the trends and inferences should be similar (e.g., post-pretraining helps with temporal ordering task, etc.).
💡: If you have any issues or suggestions, feel free to open an issue or contact us via email.
Please also consider looking at the following related papers:
- Wu et al, Audio-Text Models Do Not Yet Leverage Natural Language. Like us, they too check if models capture event ordering, albeit for audio-text models.
- Yuksekgonul et al, When and why vision-language models behave like bags-of-words, and what to do about it?, ICLR 2023. They test image-language models for understanding of object propertries, relational understanding and order sensitivity.
- Hazra et al, EgoTV : Egocentric Task Verification from Natural Language Task Descriptions, ArXiv 2023. They propose a synthetic benchmark of procedural tasks where there is an order between the subtasks, e.g.,
apple is heated, then, it is cleaned
. - Xu et al, Don’t Pour Cereal into Coffee: Differentiable Temporal Logic for Temporal Action Segmentation, NeurIPS 2022. They propose use of temporal logic to apply declarative temporal constraints to the output of deep networks.
- Xie et al, Enhance Temporal Relations in Audio Captioning with Sound Event Detection, ArXiV 2023. This paper aims to better capture temporal relationships in caption generation with sound event detection (SED), a task that locates events’ timestamps.