Bencheng Liao1,2, Xinggang Wang2 📧, Lianghui Zhu2, Qian Zhang3, Chang Huang3,
1 Institute of Artificial Intelligence, HUST, 2 School of EIC, HUST, 3 Horizon Robotics
(📧) corresponding author.
ArXiv Preprint (arXiv 2405.18425)
June 17th, 2024
: We release an initial version of ViG with code and weights.May 29th, 2024
: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️
Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory.
However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks, outperforming popular Transformer and CNN-based models. Notably, ViG-S matches DeiT-B's accuracy while using only 27% of the parameters and 20% of the FLOPs, running 2$\times$ faster on
git clone https://github.com/hustvl/ViG.git
cd ViG
conda create -n vig python=3.8
conda activate vig
# torch
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
# requirement
pip install -r requirement.txt
# GLA
pip install triton==2.2.0
cd flash-linear-attention
python setup.py develop
For single node:
cd classification
export CONFIG=configs/vig/vig-s.yaml
python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=8 \
--master_addr="127.0.0.1" --master_port=29501 main.py \
--cfg ${CONFIG} --data-path data/IN1K/ \
--output /path/to/output
For two nodes:
cd classification
export CONFIG=configs/vig/vig-b.yaml
export MASTER_IP=XXXXXXX
# for first node
python -m torch.distributed.launch --nnodes=2 --node_rank=0 --nproc_per_node=8 \
--master_addr=${MASTER_IP} --master_port=29501 main.py \
--cfg ${CONFIG} --data-path data/IN1K/ \
--output /path/to/output --batch-size 64
# for second node, modify the node_rank arg
python -m torch.distributed.launch --nnodes=2 --node_rank=1 --nproc_per_node=8 \
--master_addr=${MASTER_IP} --master_port=29501 main.py \
--cfg ${CONFIG} --data-path data/IN1K/ \
--output /path/to/output --batch-size 64
Model | #param. | Top-1 Acc. | Hugginface Repo |
---|---|---|---|
ViG-T | 6M | 77.2 | https://huggingface.co/hustvl/ViG/tree/main |
ViG-S | 23M | 81.7 | https://huggingface.co/hustvl/ViG/tree/main |
ViG-B | 89M | 82.6 | https://huggingface.co/hustvl/ViG/tree/main |
Model | #param. | Top-1 Acc. | Hugginface Repo |
---|---|---|---|
ViG-H-T | 29M | 82.8 | https://huggingface.co/hustvl/ViG/tree/main |
ViG-H-S | 50M | 83.8 | https://huggingface.co/hustvl/ViG/tree/main |
ViG-H-B | 89M | 84.2 | https://huggingface.co/hustvl/ViG/tree/main |
To evaluate ViG-S
on ImageNet-1K, run:
python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=1 \
--master_port=29501 main.py \
--cfg configs/vig/vig-s.yaml \
--batch-size 128 --data-path ./data/IN1K/ \
--output ./output/ --pretrained /path/to/ckpt \
--eval
This code is developed on the top of Vim, VMamba, VRWKV, and FLA. Thanks for their great works.
If you find ViG is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@article{vig,
title={ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention},
author={Bencheng Liao and Xinggang Wang and Lianghui Zhu and Qian Zhang and Chang Huang},
journal={arXiv preprint arXiv:2405.18425},
year={2024}
}