This repository contains the code for the paper Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions (CVPR 2024).
- 2024.06.22: We upload the ChatGPT annotated caption triplets of RefCOCO/+/g on Huggingface. The same repo also includes the LoRA checkpoints for CLIP and FLAVA.
- Create a new conda environment:
conda create --name vg python=3.8
conda activate vg
-
Install pytorch following the official website. We have successfully tested our code on pytorch 2.2.1 (
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
), but lower version should also be feasible. -
Install pip requirements via:
pip install -r requirements.txt
We use the same pre-processed data for RefCOCO/g/+ as ReCLIP for a fair comparison.
Download pre-processed data files via gsutil cp gs://reclip-sanjays/reclip_data.tar.gz
, and extract the data using tar -xvzf reclip_data.tar.gz
. This data does not include images, therefore also download the images for RefCOCO/g/+ from http://images.cocodataset.org/zips/train2014.zip.
We release the triplets files in Huggingface
They are also in Huggingface.
For CLIP/FLAVA fine-tuning, you need to download the following datasets:
Please follow the instructions provided on each website to download the raw data.
After downloading the datasets, modify the configuration file VLA_finetune/training/data_path.ini
according to the root directory of your downloaded data.
For the triplets_file
required for SWiG, it can be found in Huggingface (SWiG_triplets.json
)
The following format can be used to run experiments:
python eval_refcoco/main.py --input_file [INPUT_FILE] --image_root [IMAGE_ROOT] --method [parse/baseline/random/matching] --clip_model ViT-B/32 --triplets_file [TRIPLETS_FILE] --detector_file [DETECTION_FILE] {--rule_filter} {--enable_lora} {--lora_path [LORA_PATH]}
(/
is used above to denote different options for a given argument. Content enclosed in brackets []
should be replaced with the actual content, while braces {}
denote optional arguments.)
--input_file
: should be in .jsonl
format, e.g., reclip_preprocess/refcocog_val.jsonl (we provide these files for the datasets same as ReCLIP; see the Data Download information above).
--image_root
: the top-level directory containing all images in the dataset, e.g., COCO2014/train2014/
--detector_file
: if not specified, ground-truth proposals are used. The detection files are in reclip_data.tar.gz
and have the format {refcoco/refcocog/refcoco+}_dets_dict.json
, e.g., reclip_preprocess/refcocog_dets_dict.json
--triplets_file
: caption triplets generated using ChatGPT, e.g., triplets/gpt_refcocog_val.jsonl
--rule_filter
: Whether to filter out redundant triplets based on heuristic rules. Should be activated on default.
--enable_lora
: Whether to load pre-trained LoRA modules for CLIP and FLAVA.
--lora_path
: If enable_lora, you should specify the path to the pre-trained LoRA module, e.g., pre_trained/epoch_latest.pt
Choices for method
: "matching" stands for our proposed triplet-to-instance visual grounding pipeline. "parse", "baseline" are two grounding methods used in the baseline paper ReCLIP. "random" selects one of the proposals uniformly at random.
Choices for clip_model
: We only use ViT-B/32 for entire experiments.
To see explanations of other arguments see the eval_refcoco/main.py
file.
- Install Gentopia by running the following command:
pip install gentopia
-
Modify the
ROOT_PATH
andDATA_BASE_PATH
variables in thetriplets_chatgpt/scripts/run.sh
file. Ensure that the paths point to the correct directories. (Forreclip_preprocess
, see the Data Download information above). -
Make sure to add your OpenAI key in the
triplets_chatgpt/.env
file for authentication. -
After making the necessary modifications, run the bash file
run.sh
to generate the caption triplets.
Make sure the data is ready for VLA fine-tuning.
The following format can be used to fine-tune CLIP:
torchrun --nproc_per_node 8 --master_port 23450 -m VLA_finetune.training.main --name CLIP_finetune --lora 4 --pretrained openai --epochs 20 --warmup 150 --workers 48 --lr 0.000005 --save-frequency 5 --batch-size 128 --model ViT-B/32
The following format can be used to fine-tune FLAVA:
torchrun --nproc_per_node 8 --master_port 23452 -m VLA_finetune.training.main --name FLAVA_finetune --lora 16 --epochs 20 --warmup 150 --workers 48 --lr 0.000005 --save-frequency 5 --batch-size 128 --flava
- Upload code for Who's Waldo dataset
The code in the eval_refcoco
directory is adapted from the baseline method ReCLIP to ensure a fair comparison. We have removed code pertaining to other comparison methods originally present in ReCLIP repo to enhance readability. The code in the VLA_finetune
directory is adapted from code in TSVLC.
If you find this repository useful, please cite our paper:
@article{han2023zero,
title={Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions},
author={Han, Zeyu and Zhu, Fangrui and Lao, Qianru and Jiang, Huaizu},
journal={arXiv preprint arXiv:2311.17048},
year={2023}
}