Pingchuan Ma* · Lennart Rietdorf* · Dmytro Kotovenko · Vincent Tao Hu · Björn Ommer
CompVis Group @ LMU Munich, MCML
* equal contribution
Describing images accurately through text is key to explainability. Vision-Language Models (VLMs) as CLIP align images and texts in a shared space. Descriptions generated by Large Language Models (LLMs) can further improve their classification performance. However, it remains unclear if performance gains stem from true semantics or semantic-agnostic ensembling effects, as questioned by several prior works. To address this, we propose an alternative evaluation scenario to isolate the discriminative power of descriptions and introduce a training-free method for selecting discriminative descriptions. This method improves classification accuracy across datasets by leveraging CLIP’s local label neighborhood, offering insights into description-based classification and explainability in VLMs. Figure 1 depicts this procedure.
This repository is our official implementation for the paper "Does VLM Classification Benefit from LLM Description Semantics?". It enables the evaluation of Visual-Language Model (VLM) classification accuracy across different datasets, leveraging the semantics of descriptions generated by Large Language Models (LLMs).
Results were obtained using Ubuntu 22.04.5 LTS
, Cuda 11.8
, and Python 3.10.14
Install the necessary dependencies manually via
conda create -n <choose_name> python=3.10.14
conda activate <choose_name>
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu118
pip install tqdm
pip install torchmetrics
pip install imagenetv2_pytorch
pip install git+https://github.com/modestyachts/ImageNetV2_pytorch
pip install pyyaml
pip install git+https://github.com/openai/CLIP.git
pip install requests
The resulting python env will correspond to requirements.txt.
The datasets supported by this implementation are:
- Flowers102
- DTD (Describable Textures Dataset)
- Places365
- EuroSAT
- Oxford Pets
- Food101
- CUB-200
- ImageNet
- ImageNet V2
Most of these datasets will be automatically downloaded as torchvision
datasets and stored in ./datasets
during the first run of main.py
. Instructions for datasets that have to be installed manually can be found below.
The CUB-200 dataset requires downloading the dataset files first, e.g. from https://data.caltech.edu/records/65de6-vp158 via
wget https://data.caltech.edu/records/65de6-vp158/files/CUB_200_2011.tgz?download=1D
After that, create a directory ./datasets/cub_200
where you unpack CUB_200_2011.tgz
. The dataset is then ready for embedding.
Follow the instructions to download ImageNet's dataset files under the following link:
https://pytorch.org/vision/main/generated/torchvision.datasets.ImageNet.html
Save these files to ./datasets/ilsvrc
. The dataset is then ready to use and embed for the main.py
script.
ImageNetV2 is an additional test dataset for the ImageNet training dataset.
This dataset requires the installation of imagenet_v2_pytorch
package stated above in the Environment.
The dataset files will be downloaded automatically.
Available description pools can be found under ./descriptions
. DClip descriptions are taken from https://github.com/sachit-menon/classify_by_description_release
.
The description pools supported by this implementation are:
- DClip
- Contrastive Llama
Assignments of selected descriptions will be saved as JSON files to ./saved_descriptions
.
In the first run of main.py
, the datasets will be embedded first by CLIP's VLM backbones before the description selection pipeline depicted in Figure 1 is executed. The image embeddings will be stored in ./image_embeddings
for further usage. This speeds up further executions of the script.
To run the whole pipeline as depicted in Figure 1 call the script main.py
. As stated above, the new dataset will be downloaded and embedded in the first run of a new dataset. Use the following command with the following options:
python main.py --dataset <DATASET_NAME> --pool <DESCRIPTION_POOL> --encoding_device <CUDA_ID_0> --calculation_device <CUDA_ID_1>
--dataset
Choose the dataset to evaluate. Available options are:
- flowers
- dtd
- eurosat
- places
- food
- pets
- cub
- ilsvrc
- imagenet_v2
Be aware that Downloading and embedding the places dataset may take a long time.
Default: flowers
--pool
Select the description pool to use for the evaluation. Available options are:
- dclip
- con_llama
Default: dclip
--encoding_device
and --calculation_device
Select the cuda ID as an integer for encoding of images and texts; ID for evaluation device.
Default: 0 and 1
--backbone
Select the openai ViT CLIP backbone. Available options are:
- b32
- b16
- l14
- l14@336px
Default: b32
Our evaluation demonstrates that the proposed method significantly outperforms baselines in the classname-free setup, minimizing artificial gains from the ensembling effect. Additionally, we show that these improvements transfer to the conventional evaluation setup, achieving competitive results with substantially fewer descriptions required, while offering better interpretability.
If you use this codebase or otherwise found our work valuable, please cite our paper:
@misc{ma2024does,
title={Does VLM Classification Benefit from LLM Description Semantics?},
author={Pingchuan Ma and Lennart Rietdorf and Dmytro Kotovenko and Vincent Tao Hu and Björn Ommer},
year={2024},
eprint={2412.11917},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
- [17.12.2024] add valid arXiv link and bibtex.
- [03.12.2024] supported all datasets and tested with the env specified.
- [27.11.2024] set up the repo.
- [TBD] Support ViT-L Backone
- [TBD] Pytorch Dataloader for Image Embeddings