This is the official implementation of the paper "Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation." In this paper, we propose a novel framework to leverage visual foundation models for domain generalizable semantic segmentation (DGSS). The core idea is to fine-tune the VFM with minimal modifications and enable inference on high-resolution images. We argue that this approach can maintain the pretrained knowledge of the VFM and unleash its power for DGSS. We conduct experiments on various benchmarks and achieve an average mIoU of 70.3% on GTAV to {Cityscapes + BDD100K + Mapillary} and 71.62% on Cityscapes to {BDD100K + Mapillary}, outperforming the previous state-of-the-art approaches by 3.3% and 1.1% in average mIoU, respectively.
- Environment Setup
- Dataset Preparation
- Preparing Visual Foundation Models
- Training
- Evaluation
- [Overview of Important Files](#Overview-of-Important Files)
To set up the environment for this project, execute the following script:
chmod +x install.sh
./install.sh
This script will create a conda virtual environment named DGVFM and install all the required dependencies. To run the code, you should activate the virtual environment using the following command:
conda activate DGVFM
1. Download the dataset
- GTA: Download all image and label packages from here and extract them to
data/gta
. - Cityscapes: Download
leftImg8bit_trainvaltest.zip
andgt_trainvaltest.zip
from here and extract them todata/cityscapes
. - BDD100K: Download the 10K Images and Segmentation from here and extract them to
datasets/bdd100k
. - Mapillary: Download MAPILLARY v1.2 from here and extract it to
data/mapillary
.
The final folder structure should look like this:
DGVFM
├── ...
├── data
│ ├── gta
│ │ ├── images
│ │ ├── labels
│ ├── cityscapes
│ │ ├── leftImg8bit
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── gtFine
│ │ │ ├── train
│ │ │ ├── val
│ ├── bdd100k
│ │ ├── images
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── labels
│ │ │ ├── train
│ │ │ ├── val
│ ├── mapillary
│ │ ├── training
│ │ │ ├── images
│ │ │ ├── labels
│ │ ├── validation
│ │ │ ├── images
│ │ │ ├── val_label
├── ...
2. Convert the dataset Prepare datasets with these commands:
cd DGVFM
python tools/convert_datasets/gta.py data/gta
python tools/convert_datasets/cityscapes.py data/cityscapes
python tools/convert_datasets/mapillary2cityscape.py data/mapillary data/mapillary/cityscapes_trainIdLabel --train_id
# you do not need to convert BDD100K. It is already in the correct format.
- Download: Download pre-trained weights of VFMs and place them in the
checkpoints
directory without changing the file name. You only need to download one of the following models depending on which one you want to run:
Model | Download Link | filename | Size |
---|---|---|---|
DINOv2 | DINOv2-ViT-L/14 | dinov2_vitl14_pretrain.pth | 1.2GB |
EVA02 | EVA02-ViT-L/14 | eva02_L_pt_m38m_p14to16.pt | 613MB |
CLIP | CLIP-ViT-L/14 | ViT-L-14.pt | 890MB |
SAM | SAM-ViT-H/14 | sam_vit_h_4b8939.pth | 2.4GB |
-
Convert: Convert pre-trained weights for training or evaluation.
# convert DINOv2 python tools/convert_models/convert_dinov2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/dinov2_converted.pth # convert EVA02 python tools/convert_models/convert_eva2_512x512.py checkpoints/eva02_L_pt_m38m_p14to16.pth checkpoints/eva02_L_converted.pth # convert CLIP python tools/convert_models/convert_clip.py checkpoints/ViT-L-14.pt checkpoints/CLIP-ViT-L_converted.pth # convert SAM python tools/convert_models/convert_sam.py checkpoints/sam_vit_h_4b8939.pth checkpoints/sam_vit_h_converted.pth
Start training on a single GPU:
python tools/train.py configs/dg/gta2citys/dg_lora_dinov2_ms_masked.py
You can also run the script:
./train.sh
Run the evaluation:
python tools/test.py \
configs/dg/gta2citys/dg_lora_dinov2_ms_masked.py \
<path_to_your_checkpoint> \
--backbone checkpoints/dinov2_converted.pth
You can also run the script:
./test.sh
This section provides an overview of the code files related to the model architecture and design:
-
core/models/backbones
: This folder contains the implementation of encoder of VFMs, includingdino_v2.py
,eva_02.py
,sam_vit.py
,clip.py
.lora_backbone.py
implements the lora-based fine-tuning algorithm. -
core/models/heads
: This folder contains the implementation of the head for our VFMNet and MGRNet.Liner_head.py
implements the head for VFMNet.VFMHead.py
implements the head for MGRNet. -
core/segmentors/Ms_VFM_encoder_decoder.py
: This file implements our multi-scale training algorithm and the two-stage coarse-to-fine inference algorithm.