TSN

Content

Introduction
Data
Train
Test
Inference
Details
Reference

Introduction

Temporal Segment Network (TSN) is a classic 2D-CNN-based solution in the field of video classification. This method mainly solves the problem of long-term behavior recognition of video, and replaces dense sampling by sparsely sampling video frames, which can not only capture the global information of the video, but also remove redundancy and reduce the amount of calculation. The core idea is to average the features of each frame as the overall feature of the video, and then enter the classifier for classification. The model implemented by this code is a TSN network based on a single-channel RGB image, and Backbone uses the ResNet-50 structure.

For details, please refer to the ECCV 2016 paper Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Data

PaddleVide provides training and testing scripts on the Kinetics-400 dataset. Kinetics-400 data download and preparation please refer to Kinetics-400 data preparation

Train

Kinetics-400 data set training

Download and add pre-trained models

Load the ResNet50 weights trained on ImageNet1000 as Backbone initialization parameters ResNet50_pretrain.pdparams, or download through the command line
```
wget https://videotag.bj.bcebos.com/PaddleVideo/PretrainModel/ResNet50_pretrain.pdparams
```

Open PaddleVideo/configs/recognition/tsn/tsn_k400_frames.yaml, and fill in the downloaded weight path below pretrained:

MODEL:
    framework: "Recognizer2D"
    backbone:
        name: "ResNet"
        pretrained: fill in the path here

Start training

Kinetics-400 data set uses 8 cards for training, the training start command for frames format data is as follows

python3.7 -B -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir=log_tsn main.py --validate -c configs/recognition/ tsn/tsn_k400_frames.yaml

Test

Since the sampling method of the TSN model test mode is TenCrop with a slower speed but higher accuracy, which is different from the CenterCrop used in the verification mode during the training process, the verification index topk Acc recorded in the training log It does not represent the final test score, so after the training is completed, you can use the test mode to test the best model to obtain the final index. The command is as follows:

python3.7 -B -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir=log_tsn main.py --test -c configs/recognition/ tsn/tsn_k400_frames.yaml -w "output/TSN/TSN_best.pdparams"

When the test configuration uses the following parameters, the test indicators on the validation data set of Kinetics-400 are as follows:

backbone	Sampling method	Training Strategy	num_seg	target_size	Top-1	checkpoints
ResNet50	TenCrop	NCHW	3	224	69.81	TSN_k400.pdparams
ResNet50	TenCrop	NCHW	8	224	71.70	TSN_k400_8.pdparams

Inference

export inference model

python3.7 tools/export_model.py -c configs/recognition/tsn/tsn_k400_frames.yaml \
                                -p data/TSN_k400.pdparams \
                                -o inference/TSN

The above command will generate the model structure file TSN.pdmodel and the model weight file TSN.pdiparams required for prediction.

For the meaning of each parameter, please refer to [Model Reasoning Method](https://github.com/PaddlePaddle/PaddleVideo/blob/release/2.0/docs/zh-CN/start.md#2-Model Reasoning)

infer

python3.7 tools/predict.py --input_file data/example.avi \
                           --config configs/recognition/tsn/tsn_k400_frames.yaml \
                           --model_file inference/TSN/TSN.pdmodel \
                           --params_file inference/TSN/TSN.pdiparams \
                           --use_gpu=True \
                           --use_tensorrt=False

Details

data processing:

The model reads the mp4 data in the Kinetics-400 data set, first divides each piece of video data into num_seg segments, and then evenly extracts 1 frame of image from each segment to obtain sparsely sampled num_seg video frames , And then do the same random data enhancement to this num_seg frame image, including multi-scale random cropping, random left and right flips, data normalization, etc., and finally zoom to target_size

training strategy:

Use Momentum optimization algorithm for training, momentum=0.9
Using L2_Decay, the weight attenuation coefficient is 1e-4
Use global gradient clipping, with a clipping factor of 40.0
The total number of epochs is 100, and the learning rate will be attenuated by 0.1 times when the epoch reaches 40 and 80
Dropout_ratio=0.4

parameter initialization

The convolutional layer of the TSN model uses Paddle's default KaimingNormal and Constant initialization method, with Normal(mean=0, std= 0.01) normal distribution to initialize the weight of the FC layer, and a constant 0 to initialize the bias of the FC layer

Reference

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsn.md

tsn.md

TSN

Content

Introduction

Data

Train

Kinetics-400 data set training

Download and add pre-trained models

Start training

Test

Inference

export inference model

infer

Details

Reference

Files

tsn.md

Latest commit

History

tsn.md

File metadata and controls

TSN

Content

Introduction

Data

Train

Kinetics-400 data set training

Download and add pre-trained models

Start training

Test

Inference

export inference model

infer

Details

Reference