RWKV-CLIP: A Robust Vision-Language Representation Learner

RWKV-CLIP: A Robust Vision-Language Representation Learner
Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

📣 News

[2024/06/25]:✨The traning code and pertrained weight of RWKV-CLIP have been released.
[2024/06/11]:✨The paper of RWKV-CLIP is submitted to arXiv.

💡 Highlights

We introduce a diverse description generation framework that can leverage Large Language Models(LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Beneficial form detection tags, more semantic information can be introduced from images, which in turn further constrains LLMs and mitigates hallucinations.

We propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs.

🎨 In-Progress

Release training code
Release pretrain model weight
Release 70k Instruction Dataset
Release the generated diverse descriptions of YFCC15M

Environment installation

conda create -n rwkv_clip python=3.10 -y
conda activate rwkv_clip

pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -U openmim
mim install mmcv-full==1.7.2
pip install -r requirements.txt

Instruction Dataset

The 70K instruction dataset used to finetune LLaMA3 can be download from the Google Drive or BaiduYun

Download YFCC15M

The YFCC15M dataset we used is YFCC15M-DeCLIP, we download it from the repo, finally we successful donwload 15061515 image-text pairs.

Generate rec files

To improve the training efficience, we use MXNet to save the YFCC15M dataset to rec file, and use NVIDIA DALI to accelerate data loading and pre-processing. The sample code to generate rec files is in data2rec.py.

Pretrained Model Weight

You can download the pretrained model weight of RWKV-CLIP-B/32 from Google Drive or BaiduYun

Training

bash shell/train_RWKV_CLIP_B32_YFCC15M.sh

Evaluation

Evaluate zero shot cross-modal retireval

bash shell/test_zero_shot_retrieval.sh

Evaluate zero shot classification

bash shell/test_zero_shot_classificaiton.sh

Results

zero shot cross modal retrieval

Method	Model	MSCOCO R@1	MSCOCO R@5	MSCOCO R@10	Flickr30k R@1	Flickr30k R@5	Flickr30k R@10
CLIP	B/32	20.8/13.0	43.9/31.7	55.7/42.7	34.9/23.4	63.9/47.2	75.9/58.9
SLIP	B/32	27.7/18.2	52.6/39.2	63.9/51.0	47.8/32.3	76.5/58.7	85.9/68.8
DeCLIP	B/32	28.3/18.4	53.2/39.6	64.5/51.4	51.4/34.3	80.2/60.3	88.9/70.7
UniCLIP	B32	32.0/20.2	57.7/43.2	69.2/54.4	52.3/34.8	81.6/62.0	89.0/72.0
HiCLIP	B/32	34.2/20.6	60.3/43.8	70.9/55.3	——	——	——
ALIP	B/32	46.8/29.3	72.4/54.4	81.8/65.4	70.5/48.9	91.9/75.1	95.7/82.9
Ours	B/32	50.3/34.0	76.2/60.9	85.2/71.7	76.0/57.6	94.7/82.3	97.6/88.7

zero shot classification

Method	Model	CIFAR10	CIFAR100	Food101	Pets	Flowers	SUN397	Cars	DTD	Caltech101	Aircraft	Imagenet	Average
CLIP	B/32	63.7	33.2	34.6	20.1	50.1	35.7	2.6	15.5	59.9	1.2	32.8	31.8
SLIP	B/32	50.7	25.5	33.3	23.5	49.0	34.7	2.8	14.4	59.9	1.7	34.3	30.0
FILIP	B/32	65.5	33.5	43.1	24.1	52.7	50.7	3.3	24.3	68.8	3.2	39.5	37.2
DeCLIP	B/32	66.7	38.7	52.5	33.8	60.8	50.3	3.8	27.7	74.7	2.1	43.2	41.3
HiCLIP	B/32	74.1	46.0	51.2	37.8	60.9	50.6	4.5	23.1	67.4	3.6	40.5	41.8
ALIP	B/32	83.8	51.9	45.4	30.7	54.8	47.8	3.4	23.2	74.1	2.7	40.3	41.7
Ours	B/32	79.8	55.1	50.6	37.6	57.1	54.0	4.1	24.6	77.1	4.0	44.3	44.4

Acknowledgements

This project is based on RWKV, VisionRWKV, RAM++, LLaMA-Factory, vllm, OFA, and open_clip, thanks for their works.

License

This project is released under the MIT license. Please see the LICENSE file for more information.

📖 Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{gu2024rwkvclip,
      title={RWKV-CLIP: A Robust Vision-Language Representation Learner}, 
      author={Tiancheng Gu and Kaicheng Yang and Xiang An and Ziyong Feng and Dongnan Liu and Weidong Cai and Jiankang Deng},
      year={2024},
      eprint={2406.06973},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataloaders		dataloaders
figure		figure
model		model
shell		shell
utils		utils
LICENSE		LICENSE
README.md		README.md
dali.py		dali.py
data2rec.py		data2rec.py
loss.py		loss.py
requirements.txt		requirements.txt
text_image_retrieval.py		text_image_retrieval.py
train.py		train.py
zero_shot.py		zero_shot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RWKV-CLIP: A Robust Vision-Language Representation Learner

📣 News

💡 Highlights

🎨 In-Progress

Environment installation

Instruction Dataset

Download YFCC15M

Generate rec files

Pretrained Model Weight

Training

Evaluation

Results

zero shot cross modal retrieval

zero shot classification

Acknowledgements

License

📖 Citation

🌟Star History

About

Releases

Packages

Languages

License

CV-IP/RWKV-CLIP

Folders and files

Latest commit

History

Repository files navigation

RWKV-CLIP: A Robust Vision-Language Representation Learner

📣 News

💡 Highlights

🎨 In-Progress

Environment installation

Instruction Dataset

Download YFCC15M

Generate rec files

Pretrained Model Weight

Training

Evaluation

Results

zero shot cross modal retrieval

zero shot classification

Acknowledgements

License

📖 Citation

🌟Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages