Our team wins the champion of WSDM Cup 2023 Toloka VQA Challenge.
For details please see our technical report for the competition.
If you use this code for a paper please cite:
@article{gao2023champion,
title={Champion Solution for the WSDM2023 Toloka VQA Challenge},
author={Gao, Shengyi and Chen, Zhe and Chen, Guo and Wang, Wenhai and Lu, Tong},
journal={arXiv preprint arXiv:2301.09045},
year={2023}
}
@article{chen2022vitadapter,
title={Vision Transformer Adapter for Dense Predictions},
author={Chen, Zhe and Duan, Yuchen and Wang, Wenhai and He, Junjun and Lu, Tong and Dai, Jifeng and Qiao, Yu},
journal={arXiv preprint arXiv:2205.08534},
year={2022}
}
Install MMDetection v2.22.0.
# recommended environment: torch1.9 + cuda11.1
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install mmcv-full==1.4.2 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html
pip install timm==0.4.12
pip install tfty
pip install mmdet==2.22.0
ln -s ../detection/ops ./
cd ops & sh make.sh # compile deformable attention
Preparing the Toloka VQA Dataset and the filtered GQA dataset (optional).
Please download the converted annotations for wsdm2023 from here.
wsdm2023
└── data
├── wsdm2023
│ ├── annotations
│ ├── train
│ ├── train_sample
│ └── test_public
└── grounding_gqa
├── annotations
└── images
Name | Year | Type | Data | Repo | Paper |
---|---|---|---|---|---|
Uni-Perceiver | 2022 | Supervised | Multi-Modal | repo | paper |
Backbone | Pretrain | Head | Lr schd | Config | Download |
---|---|---|---|---|---|
ViT-Adapter-B | UniPerceiver-B | DINO | 6ep | config | ckpt | log |
ViT-Adapter-L | UniPerceiver-L | DINO | 6ep | config | ckpt | log |
To pre-train the model on the filtered GQA Dataset on a single node with 8 gpus:
sh dist_train.sh configs/dino_4scale_uniperceiver_adapter_base_6ep_gqa.py 8
sh dist_train.sh configs/dino_4scale_uniperceiver_adapter_large_6ep_gqa.py 8
- We split a val set from the training set for offline model evaluation.
Backbone | Pretrain | Head | Lr schd | Split | Val | Public Test | Private Test | Config | Download |
---|---|---|---|---|---|---|---|---|---|
ViT-Adapter-B | UniPerceiver-B+GQA | DINO | 24ep | train | 74.2 | 74.2 | - | config | ckpt | log |
ViT-Adapter-L | UniPerceiver-L+GQA | DINO | 24ep | train | 76.7 | 76.9 | - | config | ckpt | log |
ViT-Adapter-L | UniPerceiver-L+GQA | DINO | 24ep | trainval | - | 77.5 | 76.347 | config | ckpt | log |
To train the model on the Toloka VQA Dataset on a single node with 8 gpus:
sh dist_train.sh configs/dino_4scale_uniperceiver_adapter_base_24ep_gqa_wsdm2023.py 8
sh dist_train.sh configs/dino_4scale_uniperceiver_adapter_large_24ep_gqa_wsdm2023.py 8
sh dist_train.sh configs/dino_4scale_uniperceiver_adapter_large_24ep_gqa_wsdm2023_trainval.py 8
To evaluate our model on the val set on a single node with 8 gpus:
sh dist_test.sh /path/to/config /path/to/checkpoint 8 --eval bbox IoU