This is the official PyTorch implementation of the following publication:
GAGS: Granularity-Aware 3D Feature Distillation for Gaussian Splatting
Yuning Peng, Haiping Wang, Yuan Liu, Chenglu Wen, Zhen Dong, Bisheng Yang
arXiv 2024
Full Paper | Webpage | Eval Dataset
TL;DR: GAGS learns a 3D Gaussian field associated with semantic features, which enables accurate open-vocabulary 3D visual grounding in the scene.
Abstract: 3D open-vocabulary scene understanding, which accurately perceives complex semantic properties of objects in space, has gained significant attention in recent years. In this paper, we propose GAGS, a framework that distills 2D CLIP features into 3D Gaussian splatting, enabling open-vocabulary queries for renderings on arbitrary viewpoints. The main challenge of distilling 2D features for 3D fields lies in the multiview inconsistency of extracted 2D features, which provides unstable supervision for the 3D feature field. GAGS addresses this challenge with two novel strategies. First, GAGS associates the prompt point density of SAM with the camera distances, which significantly improves the multiview consistency of segmentation results. Second, GAGS further decodes a granularity factor to guide the distillation process and this granularity factor can be learned in a unsupervised manner to only select the multiview consistent 2D features in the distillation process. Experimental results on two datasets demonstrate significant performance and stability improvements of GAGS in visual grounding and semantic segmentation, with an inference speed 2 × faster than baseline methods.
The code has been tested on:
- Ubuntu 20.04
- CUDA 11.8
- Python 3.8.19
- Pytorch 2.1.0
- GeForce RTX 4090.
The repository contains submodules, thus please check it out with
# SSH
git [email protected]:WHU-USI3DV/GAGS.git --recursive
or
# HTTPS
git clone https://github.com/WHU-USI3DV/GAGS.git --recursive
Our default, provided install method is based on Conda package and environment management:
conda env create --file environment.yml
conda activate GAGS
Then, download the checkpoints of SAM from here and place it in the ckpts/
directory.
Our training process consists of two main steps:
- Segmentation and feature extraction using pre-trained 3DGS scene and a set of posed images.
- Feature distillation by freezing the geometry parameters.
Our model accepts datasets in the COLMAP format. Place your dataset in the data/<dataset_name>
directory, and ensure the following structure:
<dataset_name>
|---images
| |---<image 0>
| |---<image 1>
| |---...
|---sparse
|---0
|---cameras.bin
|---images.bin
|---points3D.bin
Note: For GPUs with VRAM ≤ 24GB, to prevent out-of-memory issues during training in some complex scenes, we recommend that the resolution of input images does not exceed 1080P. Specifically, for the two datasets tested in the paper, we recommend using the original resolution (~720p) in LERF-OVS and the downsampled size closest to 1080p for each scene in Mip-NeRF360-OVS to match our default training and eval settings.
If your input dataset does not include image pose information (e.g., images you captured), you can use the convert.py
script to extract undistorted images and SfM information, provided that COLMAP is installed. The script can also resize images (requires ImageMagick).
python convert.py -s <dataset_name> [--resize] # If not resizing, ImageMagick is not needed
The input format for convert.py
is as follows:
<dataset_name>
|---input
|---<image 0>
|---<image 1>
|---...
If COLMAP and ImageMagick are not in your environment variables, you can specify their paths using the optional --colmap_executable
and --magick_executable
arguments. For more details, refer to the3D Gaussian Splatting
Additionally, place your pre-trained 3DGS scene in the output/<case_name>
directory. We recommend using gsplat to accelerate the process.
<case_name>
|---point_cloud/iteration_30000/point_cloud.ply
|---cameras.json
|---cfg_args
|---chkpnt30000.pth
|---input.ply
Modify the corresponding parameters in GAS.sh
according to the filenames in the data/
and output/
directories, then simply run
sh GAS.sh
Once completed, the data/<dataset_name>
directory is expected to have the following structure.
<dataset_name>
|---images
| |---<image 0>
| |---<image 1>
| |---...
|---sparse
| |---0
| |---cameras.bin
| |---images.bin
| |---points3D.bin
|---depth_sample
| |---<image 0>_depth_sample.npy
| |---<image 1>_depth_sample.npy
| | |---...
|---language_features
|---<image 0>_f.npy
|---<image 0>_s.npy
|---...
Modify the corresponding parameters in GAD.sh
, then simply run
sh GAD.sh
# rendering RGB and depth
python render.py -s $PROJ_PATH/data/$DATA_NAME -m $PROJ_PATH/output/$CASE_NAME --render_mode "RGB+ED" --foundation_model "none"
# rendering language feature
python render.py -s $PROJ_PATH/data/$DATA_NAME -m $PROJ_PATH/output/$CASE_NAME --foundation_model "sam_clip" --feature_mode
For the LERF and Mip-NeRF-360 datasets, download our annotated GT labels from here and place them in the data/label
directory. Then, modify the corresponding parameters and run:
sh eval.sh
Our evaluation code is based on LERF and Langsplat. Special thanks to these amazing open-source projects!
- Release training code
- Release evaluation code
- Release evaluation GT labels for the datasets tested in the paper
- Release preprocessed datasets and pretrained models
- Release text-query relevance map visualization scripts
This repository is still under construction. Please feel free to open issues or submit pull requests. We appreciate all contributions to this project.
@article{peng2024gags,
title={GAGS: Granularity-Aware 3D Feature Distillation for Gaussian Splatting},
author={Peng, Yuning and Wang, Haiping and Liu, Yuan and Wen, Chenglu and Dong, Zhen and Yang, Bisheng},
journal={arXiv preprint arXiv:2412.13654},
year={2024}
}
We sincerely thank the excellent open-source projects:
Langsplat is the first to integrate multi-level language features into 3D Gaussian representations, advancing multi-scale language understanding.
Feature 3DGS and gsplat developed accessible 3D Gaussian rendering frameworks, significantly simplifying the representation and rendering of 3D language features in scenes.