CCNet-Pure-Pytorch

Criss-Cross Attention (2d&3d) for Semantic Segmentation in pure Pytorch with a faster and more precise implementation.

Updates

****2021/03: Three kinds of pure-pytorch implementation of 3D CCNet Module is released in CC3d.py. And you can check their correctness in check3dby2d.py and check3d.py

Introduction

I unofficially re-implement CCNet: Criss-Cross Attention for Semantic Segmentation in pure Pytorch for better compatibility under different versions and environments. Many previous open-source projects employ a Cuda extension for Pytorch, which suffer from problems of compatibility and precision loss. Moreover, Cuda extension may not be optimized and accelerated by Pytorch, when we set cudnn.benchmark = True. To address these issues, I design a Criss-Cross Attention operation in our CC.py based on tensor transformation in Pytorch, which is implemented in parallel and shows a faster speed and more precise in the forward result and backward gradient.

My Operation and Performances

CUDA extension is not necessary. Previous Criss-Cross Attention projects are using a Cuda extension for Pytorch. Here I design a more elegant pure Pytorch implementation for Criss-Cross Attention in CC.py. To check the correctness and compare it with CUDA cc_attention of the official one, run the check.py.

To check the correctness, I check my pure pytorch CC() and the official CUDA CrissCross(), the inputs are Query, Key and Value, respectively.

The theoretical output should be 3. The output of our CC() is

But the output of official CUDA CrissCross() is not exactly 3

Then I check the gradient, the theoretical gradient of z is 1. Gradient of CC() is excatly 1, but gradient of CUDA CrissCross() is 0.9999998212.

As for the speed of tranning and testing, I compare my Pytorch Criss-Cross Attention and the official CUDA Criss-Cross Attention in this project. For batch size 4 at 4 2080Ti with Ohem, my Pytorch Criss-Cross Attention costs 14m32s, and the official CUDA Criss-Cross Attention costs 15m22s on Cityscapes trainning set. For evaluation with batch size 1 at 1 2080Ti using single scale, my Pytorch Criss-Cross Attention costs 28m44s, and the official CUDA Criss-Cross Attention costs 30m59s on Cityscapes val set.
Evaluatations for a same CKPT in single scale by my pure pytorch implementation and official cc_attention.
My module

Official CUDA cc_attention

Our pure Pytorch implementation (CC.py) is faster and more precise, as well as more compatibale.

SynBN

For better compatibility under different versions and environments, I decide to use pure Pytorch implementation without using Cuda inplace-abn. I adopt Synchronized-BatchNorm-PyTorch , so it costs more GPU memory than inplace-abn. And I will try to realize an efficient inplace-abn in the future.

Requirements

CUDA extension is not necessary.
If you do not want to compare our implementation with the Cuda version, you just need Python 3, Pytorch 1.2 or 0.4, OpenCV and PIL.
If you want to compare our implementation with the Cuda version, you need Pytorch 1.1 or 1.2 and apex.

# Install **Pytorch-1.1**
$ conda install pytorch torchvision cudatoolkit=9.0 -c pytorch

# Install **Apex**
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Training and Evaluation

ImageNet Pre-trained Model can be downloaded from resnet101-imagenet.pth.

$ export CUDA_VISIBLE_DEVICES=0,1,2,3
$ python train.py --data-dir /data/datasets/Cityscapes/ --random-mirror --random-scale --restore-from ./dataset/resnet101-imagenet.pth --gpu 0,1,2,3 --learning-rate 0.01 --input-size 769,769 --weight-decay 0.0001 --batch-size 4 --num-steps 60000 --recurrence 2 --ohem 1 --ohem-thres 0.7 --ohem-keep 100000 --model ccnet


$ python evaluate.py --data-dir /data/datasets/Cityscapes/  --recurrence 2 --model ccnet --restore-from ./snapshots/CS_scenes_60000.pth --whole Ture --gpu 0 --recurrence 2 --batch-size 1

Dataset

I implement this on CityScapes dataset.

Thanks to the Third Party Libs

CCNet: Criss-Cross Attention for Semantic Segmentation
Synchronized-BatchNorm-PyTorch

To do

Implement Inplace-abn in pure Pytorch.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
Fig		Fig
Synchronized		Synchronized
cc_attention		cc_attention
dataset		dataset
loss		loss
networks		networks
utils		utils
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
evaluate.py		evaluate.py
run_local.sh		run_local.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CCNet-Pure-Pytorch

Updates

Introduction

My Operation and Performances

SynBN

Requirements

Training and Evaluation

Dataset

Thanks to the Third Party Libs

To do

About

Releases

Packages

Languages

License

Serge-weihao/CCNet-Pure-Pytorch

Folders and files

Latest commit

History

Repository files navigation

CCNet-Pure-Pytorch

Updates

Introduction

My Operation and Performances

SynBN

Requirements

Training and Evaluation

Dataset

Thanks to the Third Party Libs

To do

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages