Official code for the ICCV'23 paper "SuS-X: Training-Free Name-Only Transfer of Vision-Language Models". Authors: Vishaal Udandarao, Ankush Gupta and Samuel Albanie.
Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. We pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks: "SuS" and "TIP-X", that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines.
All our code was tested on Python 3.6.8 with Pytorch 1.9.0+cu111. Ideally, our scripts require access to a single GPU (uses .cuda()
for inference). Inference can also be done on CPUs with minimal changes to the scripts.
We recommend setting up a python virtual environment and installing all the requirements. Please follow these steps to set up the project folder correctly:
git clone https://github.com/vishaal27/SuS-X.git
cd SuS-X
python3 -m venv ./env
source env/bin/activate
pip install -r requirements.txt
We provide detailed instructions on how to set up our datasets in data/DATA.md
.
After setting up the datasets and the environment, the project root folder should look like this:
SuS-X/
|–– data
|–––– ucf101
|–––– ... 18 other datasets
|–– features
|–– gpt3_prompts
|–––– CuPL_prompts_ucf101.json
|–––– ... 18 other dataset json files
|–– README.md
|–– clip.py
|–– ... all other provided python scripts
You can run Zero-shot CLIP inference using:
python run_zs_baseline.py --dataset <dataset> --backbone <CLIP_visual_backbone>
The backbone
parameter can be one of [RN50
, RN101
, ViT-B/32
, ViT-B/16
].
You can run our re-implementation of the CALIP baseline using:
python run_calip_baseline.py --dataset <dataset> --backbone <CLIP_visual_backbone>
You can run the CuPL and CuPL+e baselines using:
python run_cupl_baseline.py --dataset <dataset> --backbone <CLIP_visual_backbone>
This script will also save the CuPL and CuPL+e text classifier weights into features/
.
We provide scripts for both SuS-SD generation and SuS-LC retrieval.
The prompts used for the Photo prompting strategy can be found in utils/prompts_helper.py
.
To generate customised CuPL prompts using GPT-3, we require access to an OpenAI token. Please create an account on OpenAI and find your key under the keys tab. Please ensure that the key is in the format sk-xxxxxxxxx
.
You can then run the following command to generate CuPL prompts for any dataset:
python generate_gpt3_prompts.py --dataset <dataset> --openai_key <openai_key>
For ensuring reproducibility, we provide all 19 dataset CuPL prompt files generated by us (and used for SuS generation and CuPL inference) in gpt3-prompts
.
For generating images using the Stable-Diffusion v1-4 checkpoint, we need a huggingface token. Please create an account on huggingface and find your token under the access tokens tab. Please ensure that the token is in the format hf_xxxxxxxxx
.
You can then generate the support set images using the command:
python generate_sd_sus.py --dataset <dataset> --num_images <number_of_images_per_class> --prompt_shorthand <prompting_strategy> --huggingface_key <huggingface_token>
<prompting_strategy>
is photo
for the Photo strategy and cupl
for the CuPL strategy (refer Sec. 3.1 of the paper for more details).The generated support set is saved in data/sus-sd/<dataset>/<prompting_strategy>
.
There are two steps for correctly creating the SuS-LC support sets:
- Downloading the URLs of the top-ranked images from LAION-5B. You can download the URLs for the images in the support set using:
python retrieve_urls_lc.py --dataset <dataset> --num_images <number_of_image_urls_per_class> --prompt_shorthand <prompting_strategy>
This will download all the URLs for the images to be downloaded in data/sus-lc/download_urls/<dataset>/<prompting_strategy>
.
- Downloading the top-ranked images using the downloaded URLs. You can download the support set images using:
python retrieve_images_lc_sus.py --dataset <dataset> --num_images <number_of_images_per_class> --prompt_shorthand <prompting_strategy>
The generated support set is saved in data/sus-lc/<dataset>/<prompting_strategy>
.
You can create the test and validation image features using:
python encode_datasets.py --dataset <dataset>
This script will save the test, validation and few-shot features in features/
.
You can create the curated SuS features using:
# for SuS-LC
python encode_sus_lc.py --dataset <dataset> --prompt_shorthand <prompting_strategy>
# for SuS-SD
python encode_sus_sd.py --dataset <dataset> --prompt_shorthand <prompting_strategy>
These scripts will save the respective SuS image features in features/
.
You can create the different text classifier weights using:
python generate_text_classifier_weights.py --dataset <dataset>
This script will again save all the text classifier weights in features/
.
For ensuring reproducibility, we release the features used for all our baselines and our best performing SuS-X-LC-P model here. We further provide detailed descriptions of the naming of the feature files in features/FEATURES.md
.
Once you have correctly saved all the feature files, you can run TIP-X using:
python tipx.py --dataset <dataset> --backbone <CLIP_visual_backbone> --prompt_shorthand <prompting_strategy> --sus_type <SuS_type>
The sus_type
parameter is lc
for SuS-LC and sd
for SuS-SD.
If you found this work useful, please consider citing it as:
@inproceedings{udandarao2022sus-x,
title={SuS-X: Training-Free Name-Only Transfer of Vision-Language Models},
author={Udandarao, Vishaal and Gupta, Ankush and Albanie, Samuel},
booktitle={ICCV},
year={2023}
}
We build on several previous well-maintained repositories like CLIP, CoOp, CLIP-Adapter, TIP-Adapter and CuPL. We thank the authors for providing such amazing code, and enabling further research towards better vision-language model adaptation. We also thank the authors of the amazing Stable-Diffusion and LAION-5B projects, both of which are pivotal components of our method.
Please feel free to open an issue or email us at [email protected].