By Feng Li*, Hao Zhang*, Shilong Liu, Jian Guo, Lionel M.Ni, and Lei Zhang.
This repository is an official implementation of the DN-DETR. Accepted to CVPR 2022 (score 112, Oral presentation). Code is avaliable now. [paper link] [中文解读]
[2022/7] Code for DINO is available here!
[2022/6]: We release a unified detection and segmentation model Mask DINO that achieves the best results on all the three segmentation tasks (54.5 AP on COCO instance leaderboard, 59.4 PQ on COCO panoptic leaderboard, and 60.8 mIoU on ADE20K semantic leaderboard)! Code will be available here.
[2022/5]Our code is available! Better performance 49.5
AP on COCO achieved with ResNet-50.
[2022/4]Code is avaliable for DAB-DETR here.
[2022/3]We build a repo Awesome Detection Transformer to present papers about transformer for detection and segmentation. Welcome to your attention!
[2022/3]DN-DETR is selected for an Oral presentation in CVPR2022.
[2022/3]We release another work DINO:DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection that for the first time establishes a DETR-like model as a SOTA model on the leaderboard. Also based on DN. Code will be avaliable here.
- We present a novel denoising training method to speedup DETR training and offer a deepened understanding of the slow convergence issue of DETR-like methods.
- DN is only a training method and be plugged into many DETR-like models or even traditional models to boost performance.
- DN-DETR achieves AP 43.4 and 48.6 with 12 and 50 epochs of training with ResNet-50 backbone. Compared with the baseline models under the same setting, DN-DETR achieves comparable performance with 50% training epochs.
- Our optmized models result in better performance. DN-Deformable-DETR achieves 49.5 with a ResNet-50 backbone.
We build upon DAB-DETR and add a denoising part to accelerate training convergence. It only adds minimal computation and will be removed during inference time. We conduct extensive experiments to validate the effectiveness of our denoising training, for example, the convergnece curve comparison. You can refer to our paper for more experimental results.
We provide our models under DAB-DETR, DAB-Deformable-DETR(deformable encoder only), and DAB-Deformable-DETR (See DAB-DETR code and paper for more details).
You can also refer to our
[model zoo in 百度网盘](提取码niet).
name | backbone | box AP | Log/Config/Checkpoint | Where in Our Paper | |
---|---|---|---|---|---|
0 | DN-DETR-R50 | R50 | 44.71 | Google Drive / BaiDu | Table 1 |
2 | DN-DETR-R50-DC5 | R50 | 46.3 | Google Drive / BaiDu | Table 1 |
5 | DN-DAB-Deformbale-DETR (Deformbale Encoder Only)3 |
R50 | 48.6 | Google Drive / BaiDu | Table 3 |
6 | DN-DAB-Deformable-DETR-R50-v24 | R50 | 49.5 (48.4 in 24 epochs) | Google Drive / BaiDu | Optimized implementation with deformable attention in both encoder and decoder. See DAB-DETR for more details. |
name | backbone | box AP | Log/Config/Checkpoint | Where in Our Paper | |
---|---|---|---|---|---|
1 | DN-DAB-DETR-R50-DC5(3 pat)2 | R50 | 41.7 | Google Drive / BaiDu | Table 2 |
4 | DN-DAB-DETR-R101-DC5(3 pat)2 | R101 | 42.8 | Google Drive / BaiDu | Table 2 |
5 | DN-DAB-Deformbale-DETR (Deformble Encoder Only)3 |
R50 | 43.4 | Google Drive / BaiDu | Table 2 |
5 | DN-DAB-Deformbale-DETR (Deformble Encoder Only)3 |
R101 | 44.1 | Google Drive / BaiDu | Table 2 |
Notes:
- 1: The result increases compared with the reported one in our paper (from
44.1
to44.7
) since we optimized the code. We did not rerun other models, so you are expected to get better performance than reported ones in our paper. - 2: The models with marks (3 pat) are trained with multiple pattern embeds (refer to Anchor DETR or DAB-DETR for more details.).
- 3: This model is based on DAB-Deformbale-DETR(Deformbale Encoder Only), which is a multiscale version of DAB-DETR. It requires 16 GPUs to train as it only use deformable attention in the encoder.
- 4: This model is based on DAB-Deformbale-DETR which is an optimized implementation with deformable DETR. See DAB-DETR for more details.
You are encouraged to use this
deformable version as it uses deformable attention in both encoder and deocder, which is more lightweight (i.e, train with 4/8 A100 GPUs) and converges faster (i.e, achieves
48.4
in 24 epochs, comparable to the 50-epoch DAB-Deformable-DETR).
Our code largely follows DAB-DETR and adds additional components for denoising training, which are warped in a file dn_components.py. There are mainly 3 functions including prepare_for_dn, dn_post_proces (the first two are used in your detection forward function to process the dn part), and compute_dn_loss(this one is used to calculate dn loss). You can import these functions and add them to your own detection model. You may also compare DN-DETR and DAB-DETR to see how these functions are added if you would like to use it in your own detection models.
You are also encouraged to apply it to some other DETR-like models or even traditional detection models and update results in this repo.
We use the DAB-DETR project as our codebase, hence no extra dependency is needed for our DN-DETR. For the DN-Deformable-DETR, you need to compile the deformable attention operator manually.
We test our models under python=3.7.3,pytorch=1.9.0,cuda=11.1
. Other versions might be available as well.
- Clone this repo
git clone https://github.com/IDEA-opensource/DN-DETR.git
cd DN-DETR
- Install Pytorch and torchvision
Follow the instruction on https://pytorch.org/get-started/locally/.
# an example:
conda install -c pytorch pytorch torchvision
- Install other needed packages
pip install -r requirements.txt
- Compiling CUDA operators
cd models/dn_dab_deformable_detr/ops
python setup.py build install
# unit test (should see all checking is True)
python test.py
cd ../../..
Please download COCO 2017 dataset and organize them as following:
COCODIR/
├── train2017/
├── val2017/
└── annotations/
├── instances_train2017.json
└── instances_val2017.json
We use the standard DN-DETR-R50 and DN-Deformable-DETR-R50 as examples for training and evalulation.
Download our DN-DETR-R50 model checkpoint from this link and perform the command below.
You can expect to get the final AP about 44.7
.
For our DN-DAB-Deformable-DETR_Deformable_Encoder_Only (download here). The final AP expected is 48.6
.
For our DN-DAB-Deformable-DETR (download here), the final AP expected is 49.5
.
# for dn_detr: 44.1 AP; optimized result is 44.7AP
python main.py -m dn_dab_detr \
--output_dir logs/dn_DABDETR/R50 \
--batch_size 1 \
--coco_path /path/to/your/COCODIR \ # replace the args to your COCO path
--resume /path/to/our/checkpoint \ # replace the args to your checkpoint path
--use_dn \
--eval
# for dn_deformable_detr: 49.5 AP
python main.py -m dn_deformable_detr \
--output_dir logs/dab_deformable_detr/R50 \
--batch_size 1 \
--coco_path /path/to/your/COCODIR \ # replace the args to your COCO path
--resume /path/to/our/checkpoint \ # replace the args to your checkpoint path
--transformer_activation relu \
--use_dn \
--eval
# for dn_deformable_detr_deformable_encoder_only: 48.6 AP
python main.py -m dn_dab_deformable_detr_deformable_encoder_only
--output_dir logs/dab_deformable_detr/R50 \
--batch_size 1 \
--coco_path /path/to/your/COCODIR \ # replace the args to your COCO path
--resume /path/to/our/checkpoint \ # replace the args to your checkpoint path
--transformer_activation relu \
--num_patterns 3 \ # use 3 pattern embeddings
--use_dn \
--eval
Similarly, you can also train our model on a single process:
# for dn_detr
python main.py -m dn_dab_detr \
--output_dir logs/dn_DABDETR/R50 \
--batch_size 1 \
--epochs 50 \
--lr_drop 40 \
--coco_path /path/to/your/COCODIR # replace the args to your COCO path
--use_dn
However, as the training is time consuming, we suggest to train the model on multi-device.
If you plan to train the models on a cluster with Slurm, here is an example command for training:
# for dn_detr: 44.4-44.7 AP
python run_with_submitit.py \
--timeout 3000 \
--job_name DNDETR \
--coco_path /path/to/your/COCODIR \
-m dn_dab_detr \
--job_dir logs/dn_DABDETR/R50_%j \
--batch_size 2 \
--ngpus 8 \
--nodes 1 \
--epochs 50 \
--lr_drop 40 \
--use_dn
# for dn_dab_deformable_detr: 49.5 AP
python run_with_submitit.py \
--timeout 3000 \
--job_name dn_dab_deformable_detr \
--coco_path /path/to/your/COCODIR \
-m dab_deformable_detr \
--transformer_activation relu \
--job_dir logs/dn_dab_deformable_detr/R50_%j \
--batch_size 2 \
--ngpus 8 \
--nodes 1 \
--epochs 50 \
--lr_drop 40 \
--use_dn
# for dn_dab_deformable_detr_deformable_encoder_only: 48.6 AP
python run_with_submitit.py \
--timeout 3000 \
--job_name dn_dab_deformable_detr_deformable_encoder_only \
--coco_path /path/to/your/COCODIR \
-m dn_dab_deformable_detr_deformable_encoder_only \
--transformer_activation relu \
--job_dir logs/dn_dab_deformable_detr/R50_%j \
--num_patterns 3 \
--batch_size 1 \
--ngpus 8 \
--nodes 2 \
--epochs 50 \
--lr_drop 40 \
--use_dn
If you want to train our DC reversion or mulitple-patterns version, add
--dilation # for DC version
--num_patterns 3 # for 3 patterns
However, this requires additional training resources and memory, i.e, use 16 GPUs.
The final AP should be similar or better to ours, as our optimized result is better than our reported
performance in the paper( for example, we report 44.1
for DN-DETR, but our new result can achieve 44.7
.
Don't be surprised if you get better result! ).
Our training setting is same as DAB-DETR but add a argument --use_dn
, you may also refer to
DAB-DETR as well.
Notes:
- The results are sensitive to the batch size. We use 16(2 images each GPU x 8 GPUs) by default.
Or run with multi-processes on a single node:
# for dn_dab_detr: 44.7 AP
python -m torch.distributed.launch --nproc_per_node=8 \
main.py -m dn_dab_detr \
--output_dir logs/dn_DABDETR/R50 \
--batch_size 2 \
--epochs 50 \
--lr_drop 40 \
--coco_path /path/to/your/COCODIR \
--use_dn
# for dn_deformable_detr: 49.5 AP
python -m torch.distributed.launch --nproc_per_node=8 \
main.py -m dn_dab_deformable_detr \
--output_dir logs/dn_dab_deformable_detr/R50 \
--batch_size 2 \
--epochs 50 \
--lr_drop 40 \
--transformer_activation relu \
--coco_path /path/to/your/COCODIR \
--use_dn
Our work is based on DAB-DETR. We also release another SOAT detection model DINO based on DN-DETR and DAB-DETR.
-
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.
Hao Zhang*, Feng Li*, Shilong Liu*, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, Heung-Yeung Shum.
arxiv 2022.
[paper] [code]. -
DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR.
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang.
International Conference on Learning Representations (ICLR) 2022.
[Paper] [Code].
DN-DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.
Copyright (c) IDEA. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@inproceedings{li2022dn,
title={Dn-detr: Accelerate detr training by introducing query denoising},
author={Li, Feng and Zhang, Hao and Liu, Shilong and Guo, Jian and Ni, Lionel M and Zhang, Lei},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={13619--13627},
year={2022}
}