Skip to content
/ VFMSeg Public

Official Implementation for paper "Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation"

Notifications You must be signed in to change notification settings

tpy001/VFMSeg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
Your Name
Mar 10, 2025
f214ed1 · Mar 10, 2025

History

73 Commits
Nov 6, 2024
Nov 19, 2024
Feb 24, 2025
Feb 24, 2025
Nov 19, 2024
Mar 10, 2025
Nov 19, 2024
Feb 24, 2025
Feb 24, 2025
Apr 29, 2024
Aug 20, 2024
May 5, 2024
Aug 20, 2024
Jun 11, 2024

Repository files navigation

Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation

This is the official implementation of the paper "Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation." In this paper, we propose a novel framework to leverage visual foundation models for domain generalizable semantic segmentation (DGSS). The core idea is to fine-tune the VFM with minimal modifications and enable inference on high-resolution images. We argue that this approach can maintain the pretrained knowledge of the VFM and unleash its power for DGSS. We conduct experiments on various benchmarks and achieve an average mIoU of 70.3% on GTAV to {Cityscapes + BDD100K + Mapillary} and 71.62% on Cityscapes to {BDD100K + Mapillary}, outperforming the previous state-of-the-art approaches by 3.3% and 1.1% in average mIoU, respectively.

figure2.png figure3.png

Table of Contents

Environment Setup

To set up the environment for this project, execute the following script:

chmod +x install.sh
./install.sh

This script will create a conda virtual environment named DGVFM and install all the required dependencies. To run the code, you should activate the virtual environment using the following command:

conda activate DGVFM

Dataset Preparation

1. Download the dataset

  • GTA: Download all image and label packages from here and extract them to data/gta.
  • Cityscapes: Download leftImg8bit_trainvaltest.zip and from here and extract them to data/cityscapes.
  • BDD100K: Download the 10K Images and Segmentation from here and extract them to datasets/bdd100k.
  • Mapillary: Download MAPILLARY v1.2 from here and extract it to data/mapillary.

The final folder structure should look like this:

DGVFM
├── ...
├── data
│   ├── gta
│   │   ├── images
│   │   ├── labels
│   ├── cityscapes
│   │   ├── leftImg8bit
│   │   │   ├── train
│   │   │   ├── val
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val
│   ├── bdd100k
│   │   ├── images
│   │   │   ├── train
│   │   │   ├── val
│   │   ├── labels
│   │   │   ├── train
│   │   │   ├── val
│   ├── mapillary
│   │   ├── training
│   │   │   ├── images
│   │   │   ├── labels
│   │   ├── validation
│   │   │   ├── images
│   │   │   ├── val_label
├── ...

2. Convert the dataset Prepare datasets with these commands:

cd DGVFM
python tools/convert_datasets/gta.py data/gta 
python tools/convert_datasets/cityscapes.py data/cityscapes
python tools/convert_datasets/mapillary2cityscape.py data/mapillary data/mapillary/cityscapes_trainIdLabel --train_id
# you do not need to convert BDD100K. It is already in the correct format.

Preparing Visual Foundation Models

  • Download: Download pre-trained weights of VFMs and place them in the checkpoints directory without changing the file name. You only need to download one of the following models depending on which one you want to run:
Model Download Link filename Size
DINOv2 DINOv2-ViT-L/14 dinov2_vitl14_pretrain.pth 1.2GB
EVA02 EVA02-ViT-L/14 eva02_L_pt_m38m_p14to16.pt 613MB
CLIP CLIP-ViT-L/14 ViT-L-14.pt 890MB
SAM SAM-ViT-H/14 sam_vit_h_4b8939.pth 2.4GB
  • Convert: Convert pre-trained weights for training or evaluation.

    # convert DINOv2
    python tools/convert_models/convert_dinov2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/dinov2_converted.pth
    # convert EVA02
    python tools/convert_models/convert_eva2_512x512.py checkpoints/eva02_L_pt_m38m_p14to16.pt checkpoints/eva02_L_converted.pth
    # convert CLIP
    python tools/convert_models/convert_clip.py checkpoints/ViT-L-14.pt checkpoints/CLIP-ViT-L_converted.pth
    # convert SAM
    python tools/convert_models/convert_sam.py checkpoints/sam_vit_h_4b8939.pth checkpoints/sam_vit_h_converted.pth

Training

Start training on a single GPU:

python tools/train.py configs/dg/gta2citys/dg_lora_dinov2_ms_masked.py

You can also run the script:

./train.sh

Pretrained Models

We provide the following pretrained models:

Setting checkpoint
GTAV -> Citys + BDD + Map gta.pth
Citys -> BDD + Map citys.pth

Evaluation

Run the evaluation:

python tools/test.py \
  configs/dg/gta2citys/dg_lora_dinov2_ms_masked.py \
  <path_to_your_checkpoint> \
  --backbone checkpoints/dinov2_converted.pth

You can also run the script:

./test.sh

Overview of Important Files

This section provides an overview of the code files related to the model architecture and design:

  • core/models/backbones: This folder contains the implementation of encoder of VFMs, including dino_v2.py,eva_02.py,sam_vit.py,clip.py. lora_backbone.py implements the lora-based fine-tuning algorithm.

  • core/models/heads: This folder contains the implementation of the head for our VFMNet and MGRNet. Liner_head.py implements the head for VFMNet. VFMHead.py implements the head for MGRNet.

  • core/segmentors/Ms_VFM_encoder_decoder.py: This file implements our multi-scale training algorithm and the two-stage coarse-to-fine inference algorithm.

Citation

If you find this code useful for your research, please consider citing our paper:

@inproceedings{
anonymous2024unleashing,
title={Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation},
author={Peiyuan Tang, Xiaodong Zhang, Chunze Yang, Haoran Yuan, Jun Sun, Danfeng Shan, Zijiang James Yan},
booktitle={The 39th Annual AAAI Conference on Artificial Intelligence},
year={2024},
url={https://openreview.net/forum?id=ZarQ2RfHxO}
}

Acknowledgment

Our implementation is based on the following repositories. We thank the authors for their contributions:

About

Official Implementation for paper "Unleashing the Power of Visual Foundation Models for Generalizable Semantic Segmentation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published