Dense object detection is widely used in automatic driving, video surveillance and other fields. This paper focuses on the challenging task of dense object detection. Currently, detection methods based on greedy algorithms such as Non-Maximum Suppression (NMS) often produce many repetitive predictions or missed detections in dense scenarios, a common problem faced by NMS-based algorithms. Although the end-to-end DETR (DEtection TRansformer) type of detectors can incorporate the post-processing de-duplication capability of NMS etc. into the network, we find that the homogeneous queries in the Query-based detector lead to the reduction of the de-duplication capability of the network and the learning efficiency of the encoder, resulting in duplicate prediction and missed detection problems. To solve this problem, we propose a learnable differentiated encoding to dehomogenise the queries, and at the same time, queries can communicate with each other by differentiated encoding information, replacing the previous self-attention among the queries. In addition, we use a joint loss on the output of the encoder that considers both location and confidence prediction to give a higher quality initialization for queries. Without cumbersome decoder stacking and guaranteeing accuracy, Our proposed end-to-end detection framework is more concise and reduces the number of parameters by about 8% compared to deformable DETR. Our method achieved excellent results on the challenging CrowdHuman dataset: 93.6% average precision (AP), 39.2% MR-2, and 84.3% JI. The performance overperformed previous SOTA methods, such as Iter-E2EDet (Progressive End-to-End Object Detection) and MIP (One proposal, multiple predictions). In addition, our method is more robust in various scenarios with different densities.
Experiments of different methods were conducted on CrowdHuman. All approaches take R-50 as the backbone.
Method | #queries | AP | MR | JI | Param |
---|---|---|---|---|---|
CrowdDet | -- | 90.7 | 41.4 | 82.4 | |
Sparse RCNN | 500 | 90.7 | 44.7 | 81.4 | |
Deformable DETR | 1000 | 91.3 | 43.8 | 83.3 | 37.7M |
Iter-E2EDet | 1000 | 92.1 | 41.5 | 84.0 | 38.0M |
Deformable DETR + Ours (6-3) | 1000 | 93.6 | 39.2 | 84.3 | 34.6M |
Deformable DETR + Ours (6-3(2))) | 1000 | 93.5 | 39.3 | 84.1 | 33.7M |
X-Y (Z) represents training with X Encoder and Y Decoder , and testing with Z decoders. other methods default to 6-6
The codebases are built on top of Deformable-DETR
- Install and build libs following Deformable-DETR.
pip install -r requirements.txt
sh lib/ops/make.sh
-
Load the CrowdHuman images from here and its annotations from here. Then update the directory path of the CrowdHuman dataset in the config.py.
-
Train Iter Deformable-DETR
bash exps/aps.sh
or for Swin-L backbone:
wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window7_224_22k.pth
bash exps/aps_swinl.sh
- Evaluate Iter Deformable-DETR. You can download the pre-trained model from here (Baidu Driver) or here(Google Driver) for direct evaluation.
bash exps/aps_test.sh