This repository contains a PyTorch implementation of SIFAR, an approach that repurposes image classifiers for efficient action recognition by rearranging input video frames into super images.
For details please see the work, Can An Image classifier Suffice for Action Recognition? by Quanfu Fan*, Richard Chen* and Rameswar Panda*.
If you use this code for a paper please cite:
@INPROCEEDINGS{fan-iclr2022,
title={Can an Image Classifier Suffice for Action Recognition?},
author={Quanfu Fan, Richard Chen, Rameswar Panda},
booktitle={International Conference on Learning Representations (ICLR)},
year={2022}
}
First, clone the repository locally:
git clone https://github.com/IBM/sifar-pytorch
pip install -r requirements.txt
To load video input, you need to install the PyAV package.
Please refer to https://github.com/IBM/action-recognition-pytorch for how to prepare action recognition benchmark datasets such as Kinetics400 and Something-to-Something. For Kinetics400, we used the urls provided at this link to download the data.
Model | Frames | super image | Image Size | Model Size | FLOPs (G) |
---|---|---|---|---|---|
SIFAR-B-7 (sifar_base_patch4_window7_224 ) |
8 | 3x3 | 224 | 87 | 138 |
SIFAR-B-12 (sifar_base_patch4_window12_192_3x3 ) |
8 | 3x3 | 192 | 87 | 106 |
SIFAR-B-14 (sifar_base_patch4_window14_224_3x3 ) |
8 | 3x3 | 224 | 87 | 147 |
SIFAR-B-12† (sifar_base_patch4_window12_192_4x4 ) |
16 | 4x4 | 192 | 87 | 189 |
SIFAR-B-14† (sifar_base_patch4_window12_224_4x4 ) |
16 | 4x4 | 224 | 87 | 263 |
SIFAR-B-12‡ (sifar_base_patch4_window12_192_3x3 ) |
8 | 3x3 | 384 | 87 | 423 |
The table above lists the configurations of different models supported by SIFAR. When training or testing a model, please make sure that the input arguments match a confiuration in the table.
Here is an example of training a 8-frame kinetics400 model with Uniform Sampling
on a single node with 6 GPUs,
python -m torch.distributed.launch --nproc_per_node=6 main.py --data_dir [path-to-video] --use_pyav --dataset kinetics400 \
--opt adamw --lr 1e-4 --epochs 30 --sched cosine --duration 8 --batch-size 2 --super_img_rows 3 --disable_scaleup \
--mixup 0.8 --cutmix 1.0 --drop-path 0.1 --pretrained --warmup-epochs 5 --no-amp --model sifar_base_patch4_window14_224_3x3 \
--output_dir [output_dir]
To enable position embedding, add '--hpe_to_token' to the script.
Below is another example of fine tuning a SSV2 model using a Kinetics400 pretrain,
python -m torch.distributed.launch --nproc_per_node=6 main.py --data_dir [path-to-video] --use_pyav --dataset sth2stv2 \
--opt adamw --lr 1e-4 --epochs 20 --sched cosine --duration 8 --batch-size 2 --super_img_rows 3 --disable_scaleup \
--mixup 0.8 --cutmix 1.0 --drop-path 0.1 --pretrained --warmup-epochs 0 --no-amp --model sifar_base_patch4_window14_224_3x3 \
--logdir [output_dir] --hpe_to_token --initial_checkpoint [path-to-pretrain]
More options for training SIFAR models can be found in main.py
. You can get help via
python3 main.py --help
To evaluate a model, add '--eval' to a training script and specify the path to the model to be tested by '--initial_checkpoint'. The number of crops and clips for evaluation can be set via '--num_clips' and '--num_crops'. Below is an example of running a model with 3 crops and 3 clipts,
python -m torch.distributed.launch --nproc_per_node=6 main.py --data_dir [path-to-video] --use_pyav --dataset sth2stv2 \
--opt adamw --lr 1e-4 --epochs 30 --sched cosine --duration 8 --batch-size 2 --super_img_rows 3 --disable_scaleup \
--mixup 0.8 --cutmix 1.0 --drop-path 0.1 --pretrained --warmup-epochs 5 --no-amp --model sifar_base_patch4_window14_224_3x3 \
--output_dir [output_dir] --hpe_to_token --initial_checkpoint [path-to-pretrain] --eval --num_crops 3 --num_clips 3
Dataset | Model | Frames | Top1 | Top5 | Download |
---|---|---|---|---|---|
Kinetics400 | SIFAR-B-12 | 8 | 80.0 | 94.5 | - |
SIFAR-B-12† | 16 | 80.4 | 94.4 | - | |
SIFAR-B-14 | 8 | 80.2 | 94.4 | link | |
SIFAR-B-14† | 16 | 81.8 | 95.2 | link | |
SSV2 | SIFAR-B-12 | 8 | 60.8 | 87.3 | - |
SIFAR-B-12† | 16 | 61.4 | 87.6 | - | |
SIFAR-B-14 | 8 | 61.6 | 87.9 | link | |
SIFAR-B-14† | 16 | 62.6 | 88.5 | link |
This repository is released under the appache-2.0. license as found in the LICENSE file.