This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:
Vision-Language Models for Vision Tasks: A Survey [Paper]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
🤩 Our paper is selected into TPAMI Top 50 Popular Paper List !!
Feel free to pull requests or contact us if you find any related papers that are not included here.
The process to submit a pull request is as follows:
- a. Fork the project into your own repository.
- b. Add the Title, Paper link, Conference, Project/Code link in
README.md
using the following format:
|[Title](Paper Link)|Conference|[Code/Project](Code/Project link)|
- c. Submit the pull request to this branch.
Last update on 2025/3/24
- [NeurIPS 2024] PLIP: Language-Image Pre-training for Person Representation Learning [Paper][Code]
- [NeurIPS 2024] LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [Paper][Code]
- [NeurIPS 2024] Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight [Paper][Code]
- [NeurIPS 2024] ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [Paper][Code]
- [NeurIPS 2024] Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation [Paper][Code]
- [NeurIPS 2024] Visual Fourier Prompt Tuning [Paper][Code]
- [NeurIPS 2024] Improving Visual Prompt Tuning by Gaussian Neighborhood Minimization for Long-Tailed Visual Recognition [Paper][Code]
- [NeurIPS 2024] Few-Shot Adversarial Prompt Learning on Vision-Language Models [Paper][Code]
- [NeurIPS 2024] Visual Prompt Tuning in Null Space for Continual Learning [Paper][Code]
- [NeurIPS 2024] IPO: Interpretable Prompt Optimization for Vision-Language Models [Paper]
- [NeurIPS 2024] LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning [Paper][Code]
- [NeurIPS 2024] Scaling Open-Vocabulary Object Detection [Paper][Code]
- [NeurIPS 2024] Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection [Paper][Code]
- [NeurIPS 2024] CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection [Paper][Code]
- [NeurIPS 2024] Relationship Prompt Learning is Enough for Open-Vocabulary Semantic Segmentation [Paper]
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
If you find our work useful in your research, please consider citing:
@article{zhang2024vision,
title={Vision-language models for vision tasks: A survey},
author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}
}
- Datasets
- Vision-Language Pre-training Methods
- Vision-Language Model Transfer Learning Methods
- Vision-Language Model Knowledge Distillation Methods
Dataset | Year | Num of Image-Text Paris | Language | Project |
---|---|---|---|---|
SBU Caption | 2011 | 1M | English | Project |
COCO Caption | 2016 | 1.5M | English | Project |
Yahoo Flickr Creative Commons 100 Million | 2016 | 100M | English | Project |
Visual Genome | 2017 | 5.4M | English | Project |
Conceptual Captions 3M | 2018 | 3.3M | English | Project |
Localized Narratives | 2020 | 0.87M | English | Project |
Conceptual 12M | 2021 | 12M | English | Project |
Wikipedia-based Image Text | 2021 | 37.6M | 108 Languages | Project |
Red Caps | 2021 | 12M | English | Project |
LAION400M | 2021 | 400M | English | Project |
LAION5B | 2022 | 5B | Over 100 Languages | Project |
WuKong | 2022 | 100M | Chinese | Project |
CLIP | 2021 | 400M | English | - |
ALIGN | 2021 | 1.8B | English | - |
FILIP | 2021 | 300M | English | - |
WebLI | 2022 | 12B | English | - |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
MNIST | 1998 | 10 | 60,000 | 10,000 | Accuracy | Project |
Caltech-101 | 2004 | 102 | 3,060 | 6,085 | Mean Per Class | Project |
PASCAL VOC 2007 | 2007 | 20 | 5,011 | 4,952 | 11-point mAP | Project |
Oxford 102 Flowers | 2008 | 102 | 2,040 | 6,149 | Mean Per Class | Project |
CIFAR-10 | 2009 | 10 | 50,000 | 10,000 | Accuracy | Project |
CIFAR-100 | 2009 | 100 | 50,000 | 10,000 | Accuracy | Project |
ImageNet-1k | 2009 | 1000 | 1,281,167 | 50,000 | Accuracy | Project |
SUN397 | 2010 | 397 | 19,850 | 19,850 | Accuracy | Project |
SVHN | 2011 | 10 | 73,257 | 26,032 | Accuracy | Project |
STL-10 | 2011 | 10 | 1,000 | 8,000 | Accuracy | Project |
GTSRB | 2011 | 43 | 26,640 | 12,630 | Accuracy | Project |
KITTI Distance | 2012 | 4 | 6,770 | 711 | Accuracy | Project |
IIIT5k | 2012 | 36 | 2,000 | 3,000 | Accuracy | Project |
Oxford-IIIT PETS | 2012 | 37 | 3,680 | 3,669 | Mean Per Class | Project |
Stanford Cars | 2013 | 196 | 8,144 | 8,041 | Accuracy | Project |
FGVC Aircraft | 2013 | 100 | 6,667 | 3,333 | Mean Per Class | Project |
Facial Emotion | 2013 | 8 | 32,140 | 3,574 | Accuracy | Project |
Rendered SST2 | 2013 | 2 | 7,792 | 1,821 | Accuracy | Project |
Describable Textures | 2014 | 47 | 3,760 | 1,880 | Accuracy | Project |
Food-101 | 2014 | 101 | 75,750 | 25,250 | Accuracy | Project |
Birdsnap | 2014 | 500 | 42,283 | 2,149 | Accuracy | Project |
RESISC45 | 2017 | 45 | 3,150 | 25,200 | Accuracy | Project |
CLEVR Counts | 2017 | 8 | 2,000 | 500 | Accuracy | Project |
PatchCamelyon | 2018 | 2 | 294,912 | 32,768 | Accuracy | Project |
EuroSAT | 2019 | 10 | 10,000 | 5,000 | Accuracy | Project |
Hateful Memes | 2020 | 2 | 8,500 | 500 | ROC AUC | Project |
Country211 | 2021 | 211 | 43,200 | 21,100 | Accuracy | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
Flickr30k | 2014 | - | 31,783 | - | Recall | Project |
COCO Caption | 2015 | - | 82,783 | 5,000 | Recall | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
UCF101 | 2012 | 101 | 9,537 | 1,794 | Accuracy | Project |
Kinetics700 | 2019 | 700 | 494,801 | 31,669 | Mean (top1, top5) | Project |
RareAct | 2020 | 122 | 7,607 | - | mWAP, mSAP | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
COCO 2014 Detection | 2014 | 80 | 83,000 | 41,000 | Box mAP | Project |
COCO 2017 Detection | 2017 | 80 | 118,000 | 5,000 | Box mAP | Project |
LVIS | 2019 | 1203 | 118,000 | 5,000 | Box mAP | Project |
ODinW | 2022 | 314 | 132,413 | 20,070 | Box mAP | Project |
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
PASCAL VOC 2012 | 2012 | 20 | 1,464 | 1,449 | mIoU | Project |
PASCAL Content | 2014 | 459 | 4,998 | 5,105 | mIoU | Project |
Cityscapes | 2016 | 19 | 2,975 | 500 | mIoU | Project |
ADE20k | 2017 | 150 | 25,574 | 2,000 | mIoU | Project |
Paper | Published in | Code/Project |
---|---|---|
GLIP: Grounded Language-Image Pre-training | CVPR 2022 | Code |
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection | NeurIPS 2022 | - |
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training | CVPR 2023 | Code |
Do Vision and Language Encoders Represent the World Similarly? | CVPR 2024 | Code |
Non-autoregressive Sequence-to-Sequence Vision-Language Models | CVPR 2024 | - |
Paper | Published in | Code/Project |
---|---|---|
Exploring Visual Prompts for Adapting Large-Scale Models | arXiv 2022 | Code |
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification | arXiv 2023 | - |
Fine-Grained Visual Prompting | arXiv 2023 | - |
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models | ICCV 2023 | Code |
Progressive Visual Prompt Learning with Contrastive Feature Re-formation | IJCV 2024 | Code |
Visual In-Context Prompting | CVPR 2024 | Code |
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance | ECCV 2024 | Code |
Paper | Published in | Code/Project |
---|---|---|
UPT: Unified Vision and Language Prompt Learning | arXiv 2022 | Code |
MVLPT: Multitask Vision-Language Prompt Tuning | arXiv 2022 | Code |
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model | arXiv 2022 | Code |
MaPLe: Multi-modal Prompt Learning | CVPR 2023 | Code |
Learning to Prompt Segment Anything Models | arXiv 2024 | - |
An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models | ICLR 2024 | - |
GalLoP: Learning Global and Local Prompts for Vision-Language Models | ECCV 2024 | - |
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts | ECCV 2024 | Code |
Learning to Prompt Segment Anything Models | arXiv 2024 | – |