(back to README.md and README_2.md for other categories)
If you find this repository useful, please consider citing this list:
@misc{chen2022transformerpaperlist,
title = {Ultimate awesome paper list: transformer and attention},
author = {Chen, Min-Hung},
journal = {GitHub repository},
url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
year = {2022},
}
- General:
- SAT: "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention", ICML, 2015. [paper]
- ETA-Transformer: "Entangled Transformer for Image Captioning", ICCV, 2019 (UTS). [Paper]
- M2-Transformer: "Meshed-Memory Transformer for Image Captioning", CVPR, 2020 (UniMoRE). [Paper][PyTorch]
- MCCFormers: "Describing and Localizing Multiple Changes with Transformers", ICCV, 2021 (AIST). [Paper][Website]
- SATIC: "Semi-Autoregressive Transformer for Image Captioning", ICCVW, 2021 (Hefei University of Technology). [Paper][PyTorch]
- DGCN: "Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning", ACMMM, 2021 (Wuhan University). [Paper]
- CPTR: "CPTR: Full Transformer Network for Image Captioning", arXiv, 2021 (CAS). [Paper]
- ReFormer: "ReFormer: The Relational Transformer for Image Captioning", arXiv, 2021 (Stony Brook University). [Paper]
- LAViTeR: "LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation", arXiv, 2021 (University at Buffalo). [Paper]
- LATGeO: "Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
- GEVST: "Geometry-Entangled Visual Semantic Transformer for Image Captioning", arXiv, 2021 (NTU, Singapore). [Paper]
- GAT: "Geometry Attention Transformer with Position-aware LSTMs for Image Captioning", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
- PureT: "End-to-End Transformer Based Model for Image Captioning", AAAI, 2022 (CAS). [Paper]
- VisualGPT: "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning", CVPR, 2022 (KAUST). [Paper][PyTorch]
- ViTCAP: "Injecting Semantic Concepts into End-to-End Image Captioning", CVPR, 2022 (Microsoft). [Paper]
- CLIP-Event: "CLIP-Event: Connecting Text and Images with Event Structures", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- ?: "Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning", CVPR, 2022 (Georgia Tech). [Paper][PyTorch]
- CLIP4IDC: "CLIP4IDC: CLIP for Image Difference Captioning", CVPRW, 2022 (Aalto University, Finland). [Paper][Code (in construction)]
- ?: "A Dual-Attentive Approach to Style-Based Image Captioning Using a CNN-Transformer Model", CVPRW, 2022 (The University of the West Indies, Jamaica). [Paper]
- SpaCap3D: "Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds", IJCAI, 2022 (University of Sydney). [Paper][Code (in construction)][Website]
- RA-Transformer: "Retrieval-Augmented Transformer for Image Captioning", International Conference on Content-based Multimedia Indexing (CMBI), 2022 (University of Modena and Reggio Emilia, Italy). [Paper]
- GRIT: "GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features", ECCV, 2022 (Tohoku University + RIKEN AIP). [Paper][PyTorch]
- ?: "Object-Centric Unsupervised Image Captioning", ECCV, 2022 (Meta). [Paper][PyTorch]
- UEDVC: "Unifying Event Detection and Captioning as Sequence Generation via Pre-Training", ECCV, 2022 (Renmin University of China). [Paper][PyTorch]
- TIger: "Explicit Image Caption Editing", ECCV, 2022 (Zhejiang University). [Paper][Code]
- DML: "Learning Distinct and Representative Modes for Image Captioning", NeurIPS, 2022 (University of Adelaide, Australia). [Paper]
- P2C: "Paraphrasing Is All You Need for Novel Object Captioning", NeurIPS, 2022 (NTU + CMU). [Paper]
- BEST: "Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning", NeurIPS, 2022 (Microsoft). [Paper]
- CapDec: "Text-Only Training for Image Captioning using Noise-Injected CLIP", EMNLP, 2022 (Tel Aviv). [Paper][Pytorch]
- ?: "Focus! Relevant and Sufficient Context Selection for News Image Captioning", EMNLP Findings, 2022 (UC Davis). [Paper]
- CVLNM: "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning", IJCV, 2022 (Southeast University, China). [Paper][PyTorch]
- ViNTER: "ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer", arXiv, 2022 (The University of Tokyo). [Paper]
- VaT: "Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning", arXiv, 2022 (Tongji University). [Paper]
- SCST-GEG: "Distincive Image Captioning via CLIP Guided Group Optimization", arXiv, 2022 (McGill University). [Paper]
- ?: "Vision Transformer Based Model for Describing a Set of Images as a Story", arXiv, 2022 (The University of Western Australia). [Paper]
- CLM: "Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment", arXiv, 2022 (CAS). [Paper]
- PromptCap: "PromptCap: Prompt-Guided Task-Aware Image Captioning", arXiv, 2022 (UW). [Paper]
- PTSN: "Progressive Tree-Structured Prototype Network for End-to-End Image Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
- DDCap: "Exploring Discrete Diffusion Models for Image Captioning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
- ARIC: "Aesthetically Relevant Image Captioning", AAAI, 2023 (Shenzhen University). [Paper][Code (in construction)]
- UAIC: "Uncertainty-Aware Image Captioning", AAAI, 2023 (Meituan). [Paper]
- LiMBeR: "Linearly Mapping from Image to Text Space", ICLR, 2023 (Brown University). [Paper]
- DiscriTune: "Cross-Domain Image Captioning with Discriminative Finetuning", CVPR, 2023 (Universitat Pompeu Fabra (UPF), Spain). [Paper]
- LIBRA: "Model-Agnostic Gender Debiased Image Captioning", CVPR, 2023 (Osaka University). [Paper]
- A-CAP: "A-CAP: Anticipation Captioning with Commonsense Knowledge", CVPR, 2023 (The University of Tokyo). [Paper]
- HAAV: "HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning", CVPR, 2023 (Georgia Tech). [Paper][Website]
- ?: "Cross-Domain Image Captioning with Discriminative Finetuning", CVPR, 2023 (Universitat Pompeu Fabra (UPF), Spain). [Paper]
- PAC-S: "Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation", CVPR, 2023 (UniMoRE, Italy). [Paper][PyTorch]
- SCD-Net: "Semantic-Conditional Diffusion Networks for Image Captioning", CVPR, 2023 (JD). [Paper][PyTorch]
- ConZIC: "ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing", CVPR, 2023 (Xidian University). [Paper][PyTorch]
- SmallCap: "SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation", CVPR, 2023 (University of Lisbon, Portugal). [Paper][PyTorch]
- LSML: "Crossing the Gap: Domain Generalization for Image Captioning", CVPR, 2023 (USTC). [Paper]
- MuE: "You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model", CVPR, 2023 (NC State). [Paper]
- OxfordTVG-HIC: "OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?", ICCV, 2023 (Oxford). [Paper][Website]
- ?: "Guiding Image Captioning Models Toward More Specific Captions", ICCV, 2023 (Google). [Paper]
- ViECap: "Transferable Decoding with Visual Entities for Zero-Shot Image Captioning", ICCV, 2023 (Southern University of Science and Technology). [Paper][Code (in construction)]
- PMA-Net: "With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning", ICCV, 2023 (University of Modena and Reggio Emilia (UniMoRE), Italy). [Paper][Code (in construction)]
- SCORER: "Self-supervised Cross-view Representation Reconstruction for Change Captioning", ICCV, 2023 (CAS). [Paper][Code (in construction)]
- TSG: "Transforming Visual Scene Graphs to Image Captions", ACL, 2023 (Southeast University, China). [Paper][PyTorch]
- InfoMetIC: "InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation", ACL, 2023 (Renmin University of China). [Paper][Code (in construction)]
- MultiCapCLIP: "MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning", ACL, 2023 (Peking). [Paper][PyTorch (in construction)]
- Cur-VL: "Learning from Children: Improving Image-Caption Pretraining via Curriculum", ACL Findings, 2023 (Columbia). [Paper][Code (in construction)]
- ?: "Text-Only Training for Visual Storytelling", ACMMM, 2023 (USTC). [Paper]
- CgT-GAN: "CgT-GAN: CLIP-guided Text GAN for Image Captioning", ACMMM, 2023 (USTC). [Paper][PyTorch]
- Re-ViLM: "Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning", arXiv, 2023 (NVIDIA). [Paper]
- Knight: "From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping", arXiv, 2023 (Alibaba). [Paper][PyTorch]
- VTT: "Visual Transformation Telling", arXiv, 2023 (CAS). [Paper]
- Caption-Anything: "Caption Anything: Interactive Image Description with Diverse Multimodal Controls", arXiv, 2023 (Southern University of Science and Technology). [Paper][PyTorch]
- COLA: "COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?", arXiv, 2023 (Boston). [Paper]
- ?: "Data Curation for Image Captioning with Text-to-Image Generative Models", arXiv, 2023 (University of Copenhagen, Denmark). [Paper]
- TLC: "Simple Token-Level Confidence Improves Caption Correctness", arXiv, 2023 (Meta). [Paper]
- VIVID: "Album Storytelling with Iterative Story-aware Captioning and Large Language Models", arXiv, 2023 (Peking). [Paper]
- MCDG: "Text-Only Image Captioning with Multi-Context Data Generation", arXiv, 2023 (USTC). [Paper]
- FuseCap: "FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions", arXiv, 2023 (Israel Institute of Technology). [Paper]
- StoryGen: "Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch (in construction)][Website]
- ?: "Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion", arXiv, 2023 (University of Milano-Bicocca, Italy). [Paper]
- SITTA: "SITTA: A Semantic Image-Text Alignment for Image Captioning", arXiv, 2023 (Johannes Kepler University, Austria). [Paper][PyTorch]
- MMNS: "Multimodal Neurons in Pretrained Text-Only Transformers", arXiv, 2023 (MIT). [Paper]
- RegionBLIP: "RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension", arXiv, 2023 (Alibaba). [Paper][PyTorch]
- ?: "Visually-Aware Context Modeling for News Image Captioning", arXiv, 2023 (KU Leuven). [Paper]
- Video:
- Masked Transformers: "End-to-End Dense Video Captioning with Masked Transformer", CVPR, 2018 (UMich + Salesforce). [Paper][PyTorch]
- BMT: "A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer", BMVC, 2020 (Tampere University, Finland). [Paper][PyTorch][Website]
- ?: "Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers", Interspeech, 2021 (MERL). [Paper]
- PDVC: "End-to-End Dense Video Captioning with Parallel Decoding", ICCV, 2021 (HKU + Southern University of Science and Technology). [Paper][PyTorch]
- MV-GPT: "End-to-end Generative Pretraining for Multimodal Video Captioning", CVPR, 2022 (Google). [Paper]
- VGCL: "Video-Guided Curriculum Learning for Spoken Video Grounding", ACMMM, 2022 (Zhejiang University). [Paper][PyTorch]
- UVC-VI: "Aligning Source Visual and Target Language Domains for Unpaired Video Captioning", TPAMI, 2022 (Peking University). [Paper]
- D2: "Dual-Level Decoupled Transformer for Video Captioning", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
- VASTA: "Diverse Video Captioning by Adaptive Spatio-temporal Attention", arXiv, 2022 (University of Tubingen, Germany). [Paper]
- VCRN: "Visual Commonsense-aware Representation Network for Video Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
- RSFD: "Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning", arXiv, 2022 (Wuhan University of Technology). [Paper][Code (in construction)]
- VLTinT: "VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning", AAAI, 2023 (University of Arkansas). [Paper]
- Vid2Seq: "Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning", CVPR, 2023 (Google). [Paper][Website]
- TextKG: "Text with Knowledge Graph Augmented Transformer for Video Captioning", CVPR, 2023 (ByteDance). [Paper]
- G2L: "G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory", ICCV, 2023 (Peking). [Paper]
- CoCap: "Accurate and Fast Compressed Video Captioning", ICCV, 2023 (CAS). [Paper][PyTorch]
- Movie101: "Movie101: A New Movie Understanding Benchmark", ACL, 2023 (Renmin University of China). [Paper][Code (in construction)]
- VidChapters-7M: "VidChapters-7M: Video Chapters at Scale", NeurIPS (Datasets and Benchmarks), 2023 (INRIA). [Paper][PyTorch][Website]
- ?: "Implicit and Explicit Commonsense for Multi-sentence Video Captioning", arXiv, 2023 (UBC). [Paper]
- Video-Verbalization: "A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot", arXiv, 2023 (Adobe). [Paper]
- Dense-VOC: "Dense Video Object Captioning from Disjoint Supervision", arXiv, 2023 (Google). [Paper]
- ?: "Exploring the Role of Audio in Video Captioning", arXiv, 2023 (ByteDance). [Paper]
- ZeroTA: "Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment", arXiv, 2023 (KAIST). [Paper]
- Video-CSR: "Video-CSR: Complex Video Digest Creation for Visual-Language Models", arXiv, 2023 (ByteDance). [Paper]
- 3D:
- Vote2Cap-DETR: "End-to-End 3D Dense Captioning with Vote2Cap-DETR", CVPR, 2023 (Fudan). [Paper][PyTorch]
- Cap3D: "Scalable 3D Captioning with Pretrained Models", arXiv, 2023 (UMich). [Paper][Dataset]
- Vote2Cap-DETR++: "Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning", arXiv, 2023 (Fudan). [Paper][PyTorch]
- Others:
- ET-Cap: "Explore and Tell: Embodied Visual Captioning in 3D Environments", ICCV, 2023 (Renmin University of China). [Paper][Code (in construction)][Website]
- General:
- MCAN: "Deep Modular Co-Attention Networks for Visual Question Answering", CVPR, 2019 (Hangzhou Dianzi University). [Paper][PyTorch]
- M4C: "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA", CVPR, 2020 (Facebook). [Paper]
- SA-M4C: "Spatially Aware Multimodal Transformers for TextVQA", ECCV, 2020 (Georgia Tech). [Paper][PyTorch][Website]
- ConClaT: "Contrast and Classify: Training Robust VQA Models", ICCV, 2021 (Georgia Tech). [Paper]
- TRAR: "TRAR: Routing the Attention Spans in Transformer for Visual Question Answering", ICCV, 2021 (Xiamen University). [Paper]
- UniQer: "Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue", ICCV, 2021 (Keio). [Paper]
- TxT: "TxT: Crossmodal End-to-End Learning with Transformers", GCPR, 2021 (TU Darmstadt). [Paper]
- ProTo: "ProTo: Program-Guided Transformer for Program-Guided Tasks", NeurIPS, 2021 (Georiga Tech). [Paper]
- VisQA: "VisQA: X-raying Vision and Language Reasoning in Transformers", arXiv, 2021 (INSA-Lyon). [Paper][PyTorch]
- Block-Skim: "Block-Skim: Efficient Question Answering for Transformer", AAAI, 2022 (* Shanghai Jiao Tong*). [Paper]
- RelViT: "RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning", ICLR, 2022 (NVIDIA). [Paper] [PyTorch]
- Hypergraph-Transformer: "Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering", ACL, 2022 (SNU). [Paper][Code (in construction)]
- X-Trans2Cap: "X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning", CVPR, 2022 (CUHK). [Paper]
- UTC: "UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog", CVPR, 2022 (Fudan). [Paper]
- LaTr: "LaTr: Layout-Aware Transformer for Scene-Text VQA", CVPR, 2022 (Amazon). [Paper]
- QAA: "Query and Attention Augmentation for Knowledge-Based Explainable Reasoning", CVPR, 2022 (University of Minnesota). [Paper][PyTorch]
- WebQA: "WebQA: Multihop and Multimodal QA", CVPR, 2022 (CMU + Microsoft). [Paper][PyTorch][Website]
- ?: "Efficient Adaptive Image-Language Learning for Visual Question Answering", CVPRW, 2022 (Google). [Paper]
- cViL: "cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation", ICPR, 2022 (IIIT, Hyderabad). [Paper]
- Distinguishing-VQA: "Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances", COLING, 2022 (Nankai University). [Paper][Code (in construction)]
- ?: "Weakly Supervised Grounding for VQA in Vision-Language Transformers", ECCV, 2022 (UCF). [Paper][PyTorch (in construction)]
- MUST-VQA: "MUST-VQA: MUltilingual Scene-text VQA", ECCVW, 2022 (UAB, Spain). [Paper]
- ?: "Training Vision-Language Models with Less Bimodal Supervision", Automated Knowledge Base Construction (AKBC), 2022 (Tel Aviv). [Paper]
- REVIVE: "REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering", NeurIPS, 2022 (Microsoft). [Paper]
- ScienceQA: "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering", NeurIPS, 2022 (AI2). [Paper][PyTorch][Website]
- FrozenBiLM: "Zero-Shot Video Question Answering via Frozen Bidirectional Language Models", NeurIPS, 2022 (INRIA). [Paper][PyTorch]
- MuRAG: "MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text", EMNLP, 2022 (Google). [Paper]
- MMBS: "Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning", EMNLP, 2022 (CAS). [Paper][PyTorch]
- EnFoRe: "Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering", EMNLP, 2022 (UT Austin). [Paper]
- CRIPP-VQA: "CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering", EMNLP, 2022 (Arizona State University). [Paper][PyTorch][Website]
- PnP-VQA: "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", EMNLP Findings, 2022 (Salesforce). [Paper]
- TMN: "Transformer Module Networks for Systematic Generalization in Visual Question Answering", arXiv, 2022 (Fujitsu). [Paper]
- ?: "On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering", arXiv, 2022 (Birla Institute of Technology Mesra, India). [Paper]
- DST: "Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
- PAVCR: "Attention Mechanism based Cognition-level Scene Understanding", arXiv, 2022 (Leibniz University of Hannover, Germany). [Paper]
- TAG: "TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation", arXiv, 2022 (Maryland + Salesforce). [Paper][PyTorch]
- UniCon: "UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering", arXiv, 2022 (University of Tokyo). [Paper]
- CLOVE: "Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task", arXiv, 2022 (NUS). [Paper][Code (in construction)]
- mVQA: "Towards Multi-Lingual Visual Question Answering", arXiv, 2022 (Google). [Paper]
- CIB: "Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
- ?: "Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering", arXiv, 2022 (CAS). [Paper]
- VLR: "Visually Grounded VQA by Lattice-based Retrieval", arXiv, 2022 (University of Bremen, Germany). [Paper]
- CMCL: "Cross-Modal Contrastive Learning for Robust Reasoning in VQA", arxiv, 2022 (University of Sydney). [Paper][PyTorch]
- CL-CrossVQA: "CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering", arXiv, 2022 (LMU Munich). [Paper]
- OFA-X: "Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations", arXiv, 2022 (University of Hamburg, Germany). [Paper][Code (in construction)]
- VLC-BERT: "VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge", WACV, 2023 (UBC, Canada). [Paper][PyTorch]
- LTG: "Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA", AAAI, 2023 (USTC). [Paper]
- SelTDA: "Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!", CVPR, 2023 (NEC). [Paper][PyTorch]
- Prophet: "Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering", CVPR, 2023 (Hangzhou Dianzi University). [Paper][PyTorch]
- GenB: "Generative Bias for Robust Visual Question Answering", CVPR, 2023 (KAIST). [Paper]
- MixPHM: "MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering", CVPR, 2023 (Xi'an Jiaotong University). [Paper]
- POEM: "Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning", CVPR, 2023 (University of Minnesota (UMN)). [Paper][PyTorch]
- LYP: "Improving Selective Visual Question Answering by Learning From Your Peers", CVPR, 2023 (Meta). [Paper]
- VQACL: "VQACL: A Novel Visual Question Answering Continual Learning Setting", CVPR, 2023 (CAS). [Paper][PyTorch]
- Img2LLM: "From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models", CVPR, 2023 (Salesforce). [Paper][PyTorch]
- Imp-VQA: "Logical Implications for Visual Question Answering Consistency", CVPR, 2023 (University of Bern, Switzerland). [Paper][PyTorch][Website]
- RMLVQA: "RMLVQA: A Margin Loss Approach For Visual Question Answering with Language Biases", CVPR, 2023 (Indian Institute of Science). [Paper][PyTorch]
- S3C: "S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning", CVPR, 2023 (Northwestern Polytechnical University, China). [Paper]
- ?: "Diversifying Joint Vision-Language Tokenization Learning", CVPRW, 2023 (DeepMind). [Paper]
- VQAAnswerTherapy: "VQA Therapy: Exploring Answer Differences by Visually Grounding Answers", ICCV, 2023 (UT Austin). [Paper][Website]
- ViTiS: "Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts", ICCVW, 2023 (INRIA). [Paper][Website]
- TwO: "Combo of Thinking and Observing for Outside-Knowledge VQA", ACL, 2023 (ByteDance). [Paper][Code (in construction)]
- Mod-Zero-VQA: "Modularized Zero-shot VQA with Pre-trained Models", ACL Findings, 2023 (Singapore Management University). [Paper]
- SaL: "Separate and Locate: Rethink the Text in Text-based Visual Question Answering", ACMMM, 2023 (CAS). [Paper][Code (in construction)]
- InfoSeek: "Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?", arXiv, 2023 (Google). [Paper][Website]
- CoVGT: "Contrastive Video Question Answering via Video Graph Transformer", arXiv, 2023 (NUS). [Paper]
- RVQA: "Toward Unsupervised Realistic Visual Question Answering", arXiv, 2023 (UCSD). [Paper]
- WHOOP: "Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images", arXiv, 2023 (Ben Gurion University of the Negev, Israel). [Paper][Website]
- IVLT: "Causality-aware Visual Scene Discovery for Cross-Modal Question Reasoning", arXiv, 2023 (Sun-Yat-Sen University). [Paper]
- MGT: "Multimodal Graph Transformer for Multimodal Question Answering", arXiv, 2023 (UC Santa Cruz). [Paper]
- VCSR: "Visual Causal Scene Refinement for Video Question Answering", arXiv, 2023 (Sun-Yat-Sen University). [Paper]
- SeeTRUE: "What You See is What You Read? Improving Text-Image Alignment Evaluation", arXiv, 2023 (Google). [Paper][PyTorch][Website]
- JADE: "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner", arXiv, 2023 (CAS). [Paper]
- NuScenes-QA: "NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario", arXiv, 2023 (Fudan). [Paper][Code (in construction)]
- LAMOC: "Zero-shot Visual Question Answering with Language Model Feedback", arXiv, 2023 (Renmin University of China). [Paper][PyTorch]
- PW-VQA: "Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA", arXiv, 2023 (University of Rochester). [Paper]
- Encyclopedic-VQA: "Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories", arXiv, 2023 (Google). [Paper]
- ?: "Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering", arXiv, 2023 (Mila). [Paper]
- R2A: "Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models", arXiv, 2023 (CUHK). [Paper]
- WikiTiLo: "Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning", arXiv, 2023 (LMU Munich). [Paper]
- GenVQA: "Generative Visual Question Answering", arXiv, 2023 (UW). [Paper]
- Context-VQA: "Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering", arXiv, 2023 (Stanford). [Paper]
- BLIVA: "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions", arXiv, 2023 (USCD). [Paper]
- NExT-GQA: "Can I Trust Your Answer? Visually Grounded Video Question Answering", arXiv, 2023 (NUS). [Paper]
- CURE: "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models", arXiv, 2023 (SRI). [Paper][Code (in construction)]
- RepARe: "Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models", arXiv, 2023 (UNC). [Paper][PyTorch]
- Video:
- ?: "Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering", arXiv, 2021 (Seoul National University). [Paper]
- TPT: "Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering", arXiv, 2021 (CAS). [Paper]
- SwinBERT: "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- WildQA: "WildQA: In-the-Wild Video Question Answering", International Conference on Computational Linguistics (COLING), 2022 (UMich). [Paper][Website]
- VGT: "Video Graph Transformer for Video Question Answering", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
- ?: "Video Question Answering with Iterative Video-Text Co-Tokenization", ECCV, 2022 (Google). [Paper][Website (in construction)]
- DeST: "Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling", BMVC, 2022 (NTU). [Paper][PyTorch]
- ViteVQA: "Towards Video Text Visual Question Answering: Benchmark and Baseline", NeurIPS, 2022 (ByteDance). [Paper][GitHub]
- WSQG: "Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering", arXiv, 2022 (Zhejiang University). [Paper]
- LocAns: "Locate before Answering: Answer Guided Question Localization for Video Question Answering", arXiv, 2022 (Fudan University). [Paper]
- NewsVideoQA: "Watching the News: Towards VideoQA Models that can Read", arXiv, 2022 (IIIT Hyderabad, India). [Paper]
- SHG-VQA: "Learning Situation Hyper-Graphs for Video Question Answering", CVPR, 2023 (UCF). [Paper][PyTorch]
- ANetQA: "ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos", CVPR, 2023 (Hangzhou Dianzi University). [Paper][Website]
- MCR: "Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering", CVPR, 2023 (Beijing Institute of Technology). [Paper][Code (in construction)]
- MIST: "MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering", CVPR, 2023 (NUS). [Paper][PyTorch]
- CaKE-LM: "Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering", CVPRW, 2023 (NTU + Columbia). [Paper]
- TransSTR: "Discovering Spatio-Temporal Rationales for Video Question Answering", ICCV, 2023 (NUS). [Paper]
- Tem-adapter: "Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer", ICCV, 2023 (CMU). [Paper][Code (in construction)]
- OVQA: "Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models", ICCV, 2023 (Korea University). [Paper]
- RaFormer: "Redundancy-aware Transformer for Video Question Answering", ACMMM, 2023 (NUS). [Paper]
- SeViLA: "Self-Chained Image-Language Model for Video Localization and Question Answering", arXiv, 2023 (UNC). [Paper][PyTorch]
- FunQA: "FunQA: Towards Surprising Video Comprehension", arXiv, 2023 (Beijing University of Posts and Telecommunication). [Paper][Code (in construction)][Website]
- 3D:
- 3D-VQA: "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes", CVPRW, 2023 (ETHZ). [Paper][Code (in construction)]
- Multi-CLIP: "Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes", arXiv, 2023 (ETHZ). [Paper]
- Audio-Visual:
- General:
- TransRefer3D: "TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding", ACMMM, 2021 (Beihang University). [Paper]
- ?: "Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers", EMNLP, 2021 (University of Trento). [Paper]
- MITVG: "Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation", ACL Findings, 2021 (Tencent). [Paper]
- TransVG: "TransVG: End-to-End Visual Grounding with Transformers", ICCV, 2021 (USTC). [Paper]
- GSRTR: "Grounded Situation Recognition with Transformers", BMVC, 2021 (POSTECH). [Paper][PyTorch]
- Referring-Transformer: "Referring Transformer: A One-step Approach to Multi-task Visual Grounding", NeurIPS, 2021 (UBC). [Paper]
- VGTR: "Visual Grounding with Transformers", arXiv, 2021 (Beihang University). [Paper]
- UNICORN: "Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling", arXiv, 2021 (Microsoft). [Paper]
- Word2Pix: "Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding", arXiv, 2021 (A*STAR). [Paper]
- CoFormer: "Collaborative Transformers for Grounded Situation Recognition", CVPR, 2022 (POSTECH). [Paper][PyTorch]
- MVT: "Multi-View Transformer for 3D Visual Grounding", CVPR, 2022 (CUHK). [Paper][PyTorch]
- GLIP: "Grounded Language-Image Pre-training", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- M-DGT: "Multi-Modal Dynamic Graph Transformer for Visual Grounding", CVPR, 2022 (University of Toronto). [Paper][PyTorch]
- QRNet: "Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
- SiRi: "SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding", ECCV, 2022 (JD). [Paper][PyTorch]
- UniTAB: "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling", ECCV, 2022 (Microsoft). [Paper]
- TAP: "Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers", ECCV, 2022 (Adobe). [Paper][GitHub][Website]
- YORO: "YORO - Lightweight End to End Visual Grounding", ECCVW, 2022 (Amazon). [Paper]
- GLIPv2: "GLIPv2: Unifying Localization and Vision-Language Understanding", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
- ?: "Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?", EMNLP, 2022 (Aix-Marseille University, France). [Paper]
- SeqTR: "SeqTR: A Simple yet Universal Network for Visual Grounding", arXiv, 2022 (Xiamen University). [Paper][Code (in construction)]
- TransVG++: "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer", arXiv, 2022 (USTC). [Paper]
- HLGT: "Hierarchical Local-Global Transformer for Temporal Sentence Grounding", arXiv, 2022 (Huazhong University of Science and Technology). [Paper]
- Dynamic-MDETR: "Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding", arXiv, 2022 (Nanjing University). [Paper]
- ClipCrop: "ClipCrop: Conditioned Cropping Driven by Vision-Language Model", arXiv, 2022 (The University of Tokyo). [Paper]
- VL-MPAG-Net: "Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing", WACV, 2023 (Indian Institute of Science). [Paper][PyTorch][Website]
- CLEVER: "Visually Grounded Commonsense Knowledge Acquisition", AAAI, 2023 (Tsinghua University). [Paper][PyTorch]
- LADS: "Referring Expression Comprehension Using Language Adaptive Inference", AAAI, 2023 (Zhejiang University). [Paper]
- ?: "Learning to Jointly Share and Prune Weights for Grounding Based Vision and Language Models", ICLR, 2023 (Samsung). [Paper]
- AMC: "Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
- CounTEX: "Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space", CVPR, 2023 (Amazon). [Paper]
- SK-VG: "Advancing Visual Grounding with Scene Knowledge: Benchmark and Method", CVPR, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
- D-ViTMDETR: "Dynamic Inference with Grounding Based Vision and Language Models", CVPR, 2023 (Amazon). [Paper]
- ?: "Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding", CVPR, 2023 (Tel Aviv). [Paper][Code (in construction)]
- RefCLIP: "RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension", CVPR, 2023 (Xiamen University). [Paper][PyTorch][Website]
- FROMAGe: "Grounding Language Models to Images for Multimodal Inputs and Outputs", ICML, 2023 (CMU). [Paper][PyTorch][Website]
- IR-VG: "Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision", ICCV, 2023 (Beihang). [Paper][Code (in construction)]
- RefEgo: "RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D", ICCV, 2023 (RIKEN). [Paper]
- CLIP-VG: "CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding", arXiv, 2023 (CAS). [Paper][Code (in construction)]
- TreePrompt: "TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding", arXiv, 2023 (HKUST). [Paper]
- OctoBERT: "World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models", arXiv, 2023 (UMich). [Paper]
- BuboGPT: "BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs", arXiv, 2023 (ByteDance). [Paper][PyTorch][Website]
- LG-DVG: "Language-Guided Diffusion Model for Visual Grounding", arXiv, 2023 (University of Toronto). [Paper]
- VGDiffZero: "VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders", arXiv, 2023 (Westlake University, China). [Paper]
- GREC: "GREC: Generalized Referring Expression Comprehension", arXiv, 2023 (NTU, Singapore). [Paper][Website]VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
- Video:
- Multi-Stage-Transformer: "Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos", CVPR, 2021 (University of Electronic Science and Technology of China). [Paper]
- GTR: "On Pursuit of Designing Multi-modal Transformer for Video Grounding", EMNLP, 2021 (Peking). [Paper]
- STVGBert: "STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding", ICCV, 2021 (Tencent). [Paper]
- DRFT: "End-to-end Multi-modal Video Temporal Grounding", NeurIPS, 2021 (UC Merced). [Paper]
- TubeDETR: "TubeDETR: Spatio-Temporal Video Grounding with Transformers", CVPR, 2022 (INRIA). [Paper][Website]
- UMT: "UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection", CVPR, 2022 (Tencent). [Paper][Code (in constrcution)]
- STVGFormer: "STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding", ACMMMW, 2022 (Sun Yat-sen University). [Paper]
- STCAT: "Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
- VideoWhisperer: "Grounded Video Situation Recognition", NeurIPS, 2022 (IIIT Hyderabad, India). [Paper][Website]
- VidGTR: "Explore and Match: End-to-End Video Grounding with Transformer", arXiv, 2022 (KAIST). [Paper]
- ?: "Language-free Training for Zero-shot Video Grounding", WACV, 2023 (Yonsei University). [Paper]
- VG-LAW: "Language Adaptive Weight Generation for Multi-task Visual Grounding", CVPR, 2023 (Zhejiang University). [Paper]
- TCSF: "You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos", CVPR, 2023 (Huazhong University of Science and Technology). [Paper]
- ?: "Weakly Supervised Temporal Sentence Grounding with Uncertainty-Guided Self-training", CVPR, 2023 (The University of Tokyo). [Paper]
- DeCo: "DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking", CVPR, 2023 (Toyota). [Paper]
- HSCNet: "Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding", CVPR, 2023 (Sun Yat-sen University). [Paper]
- WINNER: "WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding", CVPR, 2023 (Zhejiang University). [Paper]
- IRON: "Iterative Proposal Refinement for Weakly-Supervised Video Grounding", CVPR, 2023 (Microsoft). [Paper]
- ?: "Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding", CVPR, 2023 (Sun Yat-sen University). [Paper]
- ProTeGe: "ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding", CVPR, 2023 (Microsoft). [Paper]
- VidLN: "Connecting Vision and Language with Video Localized Narratives", CVPR, 2023 (Google). [Paper][Website]
- VDI: "Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training", CVPR, 2023 (Queen Mary University of London). [Paper]
- UniVTG: "UniVTG: Towards Unified Video-Language Temporal Grounding", ICCV, 2023 (NUS). [Paper][PyTorch]
- EaTR: "Knowing Where to Focus: Event-aware Transformer for Video Grounding", ICCV, 2023 (Yonsei). [Paper][PyTorch]
- TSGSV: "Temporal Sentence Grounding in Streaming Videos", ACMMM, 2023 (Shandong University). [Paper]
- ?: "Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos", arXiv, 2023 (Southern University of Science and Technology, China). [Paper]
- MomentDiff: "MomentDiff: Generative Video Moment Retrieval from Random to Real", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
- BM-DETR: "Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval", arXiv, 2023 (Seoul National University (SNU)). [Paper][PyTorch (in construction)]
- ?: "Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models", WACV, 2024 (Queen Mary University of London). [Paper]
- 3D:
- ViL3DRel: "Language Conditioned Spatial Relation Reasoning for 3D Object Grounding", NeurIPS, 2022 (INRIA). [Paper][Website]
- LAR: "Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding", NeurIPS, 2022 (KAUST). [Paper][Website]
- 3D-CG: "3D Concept Grounding on Neural Fields", NeurIPS, 2022 (MIT). [Paper][Website]
- NS3D: "NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations", CVPR, 2023 (Stanford). [Paper]
- EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding", CVPR, 2023 (Peking University). [Paper]
- ?: "Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding", ICCV, 2023 (Zhejiang University). [Paper]
- Multi3DRefer: "Multi3DRefer: Grounding Text Description to Multiple 3D Objects", ICCV, 2023 (Simon Fraser). [Paper]
- UniT3D: "UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding", ICCV, 2023 (TUM). [Paper]
- 3DOGSFormer: "Dense Object Grounding in 3D Scenes", ACMMM, 2023 (Peking). [Paper]
- ViewRefer: "ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- ?: "What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions", arXiv, 2023 (Columbia). [Paper]
- 3DRP-Net: "3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding", arXiv, 2023 (Zhejiang University). [Paper]
- 3DRefTR: "A Unified Framework for 3D Point Cloud Visual Grounding", arXiv, 2023 (Xiamen University). [Paper][PyTorch]
- CoT3DRef: "CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding", arXiv, 2023 (KAUST). [Paper]
- General:
- LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers", EMNLP, 2019 (UNC). [Paper][PyTorch]
- ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", NeurIPS, 2019 (Georgia Tech). [Paper][PyTorch]
- Unified-VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA", AAAI, 2020 (UMich + Microsoft). [Paper][PyTorch]
- UNITER: "UNITER: UNiversal Image-TExt Representation Learning", ECCV, 2020 (Microsoft). [Paper][PyTorch]
- VinVL: "VinVL: Revisiting Visual Representations in Vision-Language Models", CVPR, 2021 (Microsoft). [Paper][Code]
- CATT: "Causal Attention for Vision-Language Tasks", CVPR, 2021 (NTU Singapore). [Paper][PyTorch]
- ViLT: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision", ICML, 2021 (Kakao). [Paper][PyTorch]
- MERLOT: "MERLOT: Multimodal Neural Script Knowledge Models", NeurIPS, 2021 (UW + AI2). [Paper][Tensorflow][Website]
- SVO-Probes: "Probing Image-Language Transformers for Verb Understanding", arXiv, 2021 (DeepMind). [Paper]
- CLIP-ViL: "How Much Can CLIP Benefit Vision-and-Language Tasks?", arXiv, 2021 (Berkeley + UCLA). [Paper][PyTorch]
- Florence: "Florence: A New Foundation Model for Computer Vision", arXiv, 2021 (Microsoft). [Paper]
- UFO: "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning", arXiv, 2021 (Microsoft). [Paper]
- SimVLM: "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision", ICLR, 2022 (Google). [Paper]
- LiT: "LiT: Zero-Shot Transfer with Locked-image text Tuning", CVPR, 2022 (Google). [Paper]
- UniCL: "Unified Contrastive Learning in Image-Text-Label Space", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- FLAVA: "FLAVA: A Foundational Language And Vision Alignment Model", CVPR, 2022 (Meta). [Paper][Pretrained Model][Code][Dataset][Website][Demos]
- LEMON: "Scaling Up Vision-Language Pre-training for Image Captioning", CVPR, 2022 (Microsoft). [Paper]
- METER: "An Empirical Study of Training End-to-End Vision-and-Language Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- Uni-Perceiver: "Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks", CVPR, 2022 (SenseTime). [Paper][PyTorch]
- MERLOT-Reserve: "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound", CVPR, 2022 (UW + AI2). [Paper][JAX][Website]
- Omnivore: "Omnivore: A Single Model for Many Visual Modalities", CVPR, 2022 (Meta). [Paper][PyTorch][Website]
- CM-mix: "Pre-training image-language transformers for open-vocabulary tasks", CVPRW, 2022 (Google). [Paper]
- VLMixer: "VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix", ICML, 2022 (Southern University of Science and Technology). [Paper][Code (in construction)]
- VLUE: "VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models", ICML, 2022 (ByteDance). [Paper][Website][PyTorch]
- X-VLM: "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts", ICML, 2022 (ByteDance). [Paper][PyTorch]
- BLIP: "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation", ICML, 2022 (Salesforce). [Paper][PyTorch]
- OFA: "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework", ICML, 2022 (Alibaba). [Paper][PyTorch]
- MS-CLIP: "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
- GRIT-VLP: "GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
- SIMLA: "Single-Stream Multi-Level Alignment for Vision-Language Pretraining", ECCV, 2022 (Northeastern University). [Paper][PyTorch][Website]
- Switch-BERT: "Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input", ECCV, 2022 (Ant Group). [Paper]
- OmniVL: "OmniVL: One Foundation Model for Image-Language and Video-Language Tasks", NeurIPS, 2022 (Microsoft). [Paper]
- UniCLIP: "UniCLIP: Unified Framework for Contrastive Language-Image Pre-training", NeurIPS, 2022 (LG). [Paper]
- Uni-Perceiver-MoE: "Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs", NeurIPS, 2022 (SenseTime). [Paper][PyTorch]
- CLOOB: "CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP", NeurIPS, 2022 (Johannes Kepler University, Austria). [Paper][PyTorch]
- CyCLIP: "CyCLIP: Cyclic Contrastive Language-Image Pretraining", NeurIPS, 2022 (UCLA). [Paper]
- ?: "Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP", NeurIPS, 2022 (UW). [Paper][Pytorch]
- PyramidCLIP: "PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining", NeurIPS, 2022 (Tencent). [Paper]
- ?: "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning", NeurIPS, 2022 (Stanford). [Paper][Website]
- LIMoE: "Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts", NeurIPS, 2022 (Google). [Paper]
- VLMo: "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts", NeurIPS, 2022 (Microsoft). [Paper][PyTorch (in construction)]
- Knowledge-CLIP: "Contrastive Language-Image Pre-Training with Knowledge Graphs", NeurIPS, 2022 (Tsinghua). [Paper]
- Flamingo: "Flamingo: a Visual Language Model for Few-Shot Learning", NeurIPS, 2022 (DeepMind). [Paper]
- LOUPE: "Fine-Grained Semantically Aligned Vision-Language Pre-Training", NeurIPS, 2022 (Huawei). [Paper][Code (in construction)]
- FIBER: "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
- UViM: "UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes", NeurIPS, 2022 (Google). [Paper]
- LAION-5B: "LAION-5B: An open large-scale dataset for training next generation image-text models", NeurIPS (Datasets and Benchmarks), 2022 (LAION). [Paper][Website]
- Wukong: "Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark", NeurIPS (Datasets and Benchmarks), 2022 (Huawei). [Paper][Website]
- TaiSu: "TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training", NeurIPS (Datasets and Benchmarks), 2022 (CAS). [Paper][PyTorch]
- WinoGAViL: "WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models", NeurIPS (Datasets and Benchmarks), 2022 (The Hebrew University of Jerusalem, Israel). [Paper][Website]
- ELEVATER: "ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models", NeurIPS (Datasets and Benchmarks), 2022 (Microsoft). [Paper][Website]
- ?: "Robustness Analysis of Video-Language Models Against Visual and Language Perturbations", NeurIPS (Datasets and Benchmarks), 2022 (UCF). [Paper][Website]
- GIT: "GIT: A Generative Image-to-text Transformer for Vision and Language", TMLR, 2022 (Microsoft). [Paper]
- CoCa: "CoCa: Contrastive Captioners are Image-Text Foundation Models", TMLR, 2022 (Google). [Paper][PyTorch (lucidrains)]
- MultiMAE: "MultiMAE: Multi-modal Multi-task Masked Autoencoders", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
- VLC: "Training Vision-Language Transformers from Captions Alone", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
- CCLM: "Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training", arXiv, 2022 (ByteDance). [Paper]
- VL-BEiT: "VL-BEiT: Generative Vision-Language Pretraining", arXiv, 2022 (Microsoft). [Paper]
- MetaLM: "Language Models are General-Purpose Interfaces", arXiv, 2022 (Microsoft). [Paper][PyTorch]
- Bridge-Tower: "Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
- e-CLIP: "e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce", arXiv, 2022 (NAVER). [Paper]
- LW-Transformer: "Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
- UCM: "Self-Training Vision Language BERTs with a Unified Conditional Model", arXiv, 2022 (NTU, Singapore). [Paper]
- Prefix-conditioning: "Prefix Conditioning Unifies Language and Label Supervision", arXiv, 2022 (Google). [Paper]
- VLMAE: "VLMAE: Vision-Language Masked Autoencoder", arXiv, 2022 (Tencent). [Paper]
- ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment", arXiv, 2022 (Sorbonne University, France). [Paper][Code (in construction)]
- DetailCLIP: "Injecting Image Details into CLIP's Feature Space", arXiv, 2022 (Megvii). [Paper]
- ?: "Pre-training image-language transformers for open-vocabulary tasks", arXiv, 2022 (Google). [Paper]
- ERNIE: "ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training", arXiv, 2022 (Baidu). [Paper][Paddle]
- VoLTA: "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment", arXiv, 2022 (JHU). [Paper]
- ?: "One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
- MAPL: "MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting", arXiv, 2022 (Mila). [Paper]
- EfficientVLM: "EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning", arXiv, 2022 (Bytedance). [Paper][PyTorch (in construction)]
- CN-CLIP: "Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese", arXiv, 2022 (Alibaba). [Paper]
- CLOSE: "I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data", arXiv, 2022 (AI2). [Paper]
- X2-VLM: "X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks", arXiv, 2022 (ByteDance). [Paper][Code (in construction)]
- SkillNet: "One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code", arXiv, 2022 (Tencent). [Paper]
- Compound-Tokens: "Compound Tokens: Channel Fusion for Vision-Language Representation Learning", arXiv, 2022 (Google). [Paper]
- WFH: "Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision", WACV, 2023 (Aalto University, Finland). [Paper]
- Perceiver-VL: "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention", WACV, 2023 (UNC). [Paper][PyTorch]
- MixGen: "MixGen: A New Multi-Modal Data Augmentation", WACVW, 2023 (Amazon). [Paper]
- ?: "Unifying Vision-Language Representation Space with Single-tower Transformer", AAAI, 2023 (NAVER). [Paper]
- PaLI: "PaLI: A Jointly-Scaled Multilingual Language-Image Model", ICLR, 2023 (Google). [Paper]
- LilT: "Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning", ICLR, 2023 (Northeastern University). [Paper][PyTorch]
- CLIPs: "Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning", ICLR, 2023 (Stanford). [Paper]
- HiCLIP: "HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention", ICLR, 2023 (Rutgers University). [Paper]
- DeCap: "DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training", ICLR, 2023 (Zhejiang University). [Paper][PyTorch]
- MaskVLM: "Masked Vision and Language Modeling for Multi-modal Representation Learning", ICLR, 2023 (Amazon). [Paper]
- DaVinci: "Write and Paint: Generative Vision-Language Models are Unified Modal Learners", ICLR, 2023 (ByteDance). [Paper][Code (in construction)]
- EVA: "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale", CVPR, 2023 (Beijing Academy of Artificial Intelligence (BAAI)). [Paper][PyTorch]
- FLM: "Accelerating Vision-Language Pretraining with Free Language Modeling", CVPR, 2023 (Tencent). [Paper][PyTorch]
- FDT: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
- VILA: "VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining", CVPR, 2023 (Google). [Paper][JAX]
- BEiT-3: "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- ReVeaL: "REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory", CVPR, 2023 (Google). [Paper][Website]
- SCL: "Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning", CVPR, 2023 (Tencent). [Paper]
- EPIC: "Leveraging per Image-Token Consistency for Vision-Language Pre-training", CVPR, 2023 (ByteDance). [Paper]
- PTP: "Position-guided Text Prompt for Vision-Language Pre-training", CVPR, 2023 (Sea AI Lab). [Paper][PyTorch]
- PHASE: "Uncurated Image-Text Datasets: Shedding Light on Demographic Bias", CVPR, 2023 (Osaka University). [Paper][GitHub]
- Uni-Perceiver-v2: "Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- ?: "Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language", CVPR, 2023 (Beijing Institute of Technology). [Paper]
- GIVL: "GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods", CVPR, 2023 (Amazon). [Paper]
- FLIP: "Scaling Language-Image Pre-training via Masking", CVPR, 2023 (Meta). [Paper][PyTorch]
- MAP: "MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model", CVPR, 2023 (Tsinghua + Waseda). [Paper][PyTorch
- DANCE: "Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles", CVPR, 2023 (Microsoft). [Paper][PyTorch (in construction)][Website]
- xCLIP: "Non-Contrastive Learning Meets Language-Image Pre-Training", CVPR, 2023 (Microsoft). [Paper]
- SVLC: "Teaching Structured Vision & Language Concepts to Vision&Language Models", CVPR, 2023 (IBM). [Paper]
- DeAR: "DeAR: Debiasing Vision-Language Models with Additive Residuals", CVPR, 2023 (Adobe). [Paper][GitHub]
- ?: "Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning", CVPR, 2023 (Amazon). [Paper]
- ?: "Joint Adaptive Representations for Image-Language Learning", CVPRW, 2023 (DeepMind). [Paper]
- BLIP-2: "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models", ICML, 2023 (Salesforce). [Paper][PyTorch]
- RLEG: "RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation", ICML, 2023 (Alibaba). [Paper]
- Mod-X: "Continual Vision-Language Representation Learning with Off-Diagonal Information", ICML, 2023 (Huawei). [Paper]
- ILLUME: "ILLUME: Rationalizing Vision-Language Models through Human Interactions", ICML, 2023 (German Center for Artificial Intelligence (DFKI)). [Paper][PyTorch]
- Pix2Struct: "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding", ICML, 2023 (Google). [Paper]
- MERU: "Hyperbolic Image-Text Representations", ICML, 2023 (Meta). [Paper]
- ?: "Measuring Progress in Fine-grained Vision-and-Language Understanding", ACL, 2023 (DeepMind). [Paper]
- RELIT: "Weakly Supervised Vision-and-Language Pre-training with Relative Representations", ACL, 2023 (Tsinghua). [Paper]
- PuMer: "PuMer: Pruning and Merging Tokens for Efficient Vision Language Models", ACL, 2023 (UW). [Paper]
- SINC: "SINC: Self-Supervised In-Context Learning for Vision-Language Tasks", ICCV, 2023 (Microsoft). [Paper]
- ALIP: "ALIP: Adaptive Language-Image Pre-training with Synthetic Caption", ICCV, 2023 (DeepGlint, China). [Paper][PyTorch]
- SigLiT: "Sigmoid Loss for Language Image Pre-Training", ICCV, 2023 (Google). [Paper]
- VL-PET: "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control", ICCV, 2023 (CUHK). [Paper][PyTorch]
- GrowCLIP: "GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training", ICCV, 2023 (Sun Yat-sen University). [Paper]
- ViLLA: "ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data", ICCV, 2023 (Stanford). [Paper][PyTorch]
- CFM-ViT: "Contrastive Feature Masking Open-Vocabulary Vision Transformer", ICCV, 2023 (DeepMind). [Paper]
- OPTIMA: "Module-wise Adaptive Distillation for Multimodality Foundation Models", NeurIPS, 2023 (Google). [Paper]
- KOSMOS-1: "Language Is Not All You Need: Aligning Perception with Language Models", arXiv, 2023 (Microsoft). [Paper][Code]
- Prismer: "Prismer: A Vision-Language Model with An Ensemble of Experts", arXiv, 2023 (NVIDIA). [Paper][PyTorch][Website]
- RVLM: "Replacement as a Self-supervision for Fine-grained Vision-language Pre-training", arXiv, 2023 (Harbin Institute of Technology). [Paper]
- MuLTI: "MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling", arXiv, 2023 (Alibaba). [Paper]
- VL-MoE: "Scaling Vision-Language Models with Sparse Mixture of Experts", arXiv, 2023 (Berkeley + Microsoft). [Paper]
- EVA-02: "EVA-02: A Visual Representation for Neon Genesis", arXiv, 2023 (BAAI). [Paper][PyTorch]
- CoBIT: "CoBIT: A Contrastive Bi-directional Image-Text Generation Model", arXiv, 2023 (Google). [Paper]
- EqSim: "Equivariant Similarity for Vision-Language Foundation Models", arXiv, 2023 (Microsoft). [Paper][PyTorch]
- EVA-CLIP: "EVA-CLIP: Improved Training Techniques for CLIP at Scale", arXiv, 2023 (BAAI). [Paper][PyTorch]
- Sig: "Sigmoid Loss for Language Image Pre-Training", arXiv, 2023 (Google). [Paper]
- MaMMUT: "MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks", arXiv, 2023 (Google). [Paper]
- CAVL: "CAVL: Learning Contrastive and Adaptive Representations of Vision and Language", arXiv, 2023 (CMU). [Paper]
- MoMo: "MoMo: A shared encoder Model for text, image and multi-Modal representations", arXiv, 2023 (Amazon). [Paper]
- REAVL: "Retrieval-based Knowledge Augmented Vision Language Pre-training", arXiv, 2023 (Tencent). [Paper]
- ALBEF-MI: "Vision Lanauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation", arXiv, 2023 (Alibaba). [Paper]
- Helip: "Boosting Visual-Language Models by Exploiting Hard Samples", arXiv, 2023 (Huawei). [Paper]
- IMP: "Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception", arXiv, 2023 (Google). [Paper]
- Musketeer: "Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts", arXiv, 2023 (Amazon). [Paper]
- GVT: "What Makes for Good Visual Tokenizers for Large Language Models?", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
- S-CLIP: "S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions", arXiv, 2023 (KAIST). [Paper]
- VisorGPT: "VisorGPT: Learning Visual Prior via Generative Pre-Training", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
- IdealGPT: "IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models", arXiv, 2023 (Columbia University). [Paper][PyTorch]
- PaLI-X: "PaLI-X: On Scaling up a Multilingual Vision and Language Model", arXiv, 2023 (Google). [Paper]
- CrossGET: "CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
- TL;DR: "Too Large; Data Reduction for Vision-Language Pre-Training", arXiv, 2023 (NUS). [Paper][Code (in construction)]
- DiffusionITM: "Are Diffusion Models Vision-And-Language Reasoners?", arXiv, 2023 (Mila). [Paper]
- COSA: "COSA: Concatenated Sample Pretrained Vision-Language Foundation Model", arXiv, 2023 (ByteDance). [Paper][PyTorch]
- Babel-ImageNet: "Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations", arXiv, 2023 (University of Würzburg, Germany). [Paper][PyTorch]
- Kosmos-2: "Kosmos-2: Grounding Multimodal Large Language Models to the World", arXiv, 2023 (Microsoft). [Paper][PyTorch][Demo]
- LENS: "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language", arXiv, 2023 (Contextual AI + Stanford). [Paper][PyTorch][Demo]
- OBELISC: "OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents", arXiv, 2023 (Hugging Face). [Paper][GitHub]
- Emu: "Generative Pretraining in Multimodality", arXiv, 2023 (BAAI). [Paper][PyTorch]
- mBLIP: "mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs", arXiv, 2023 (University of Wurzburg, Germany). [Paper][PyTorch]
- P-Former: "Bootstrapping Vision-Language Learning with Decoupled Language Pre-training", arXiv, 2023 (Dartmouth College). [Paper]
- SEED-OPT: "Planting a SEED of Vision in Large Language Model", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
- OpenFlamingo: "OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models", arXiv, 2023 (UW). [Paper][PyTorch]
- Free-ATM: "Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks", arXiv, 2023 (ByteDance). [Paper]
- LCL: "Link-Context Learning for Multimodal LLMs", arXiv, 2023 (SenseTime). [Paper]
- DLIP: "DLIP: Distilling Language-Image Pre-training", arXiv, 2023 (ByteDance). [Paper]
- ViLTA: "ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation", arXiv, 2023 (Tsinghua). [Paper]
- DAS: "Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models", arXiv, 2023 (Xiamen University). [Paper]
- LaVIT: "Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization", arXiv, 2023 (Kuaishou). [Paper][Code (in construction)]
- MMICL: "MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning", arXiv, 2023 (Peking). [Paper][PyTorch]
- ELIP: "ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens", arXiv, 2023 (NUS). [Paper]
- SEED-LLaMA: "Making LLaMA SEE and Draw with SEED Tokenizer", arXiv, 2023 (Tencent). [Paper][PyTorch]
- ITIT: "Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency", arXiv, 2023 (Google). [Paper]
- SimVLG: "SimVLG: Simple and Efficient Pretraining of Visual Language Generative Models", arXiv, 2023 (ByteDance). [Paper]
- VeCLIP: "From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions", arXiv, 2023 (Apple). [Paper]
- PaLI-3: "PaLI-3 Vision Language Models: Smaller, Faster, Stronger", arXiv, 2023 (Google). [Paper]
- COMM: "From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models", arXiv, 2023 (Huawei). [Paper][PyTorch (in construction)]
- Video:
- COOT: "COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning", NeurIPS, 2020 (University of Freiburg). [Paper][PyTorch]
- Parameter-Reduction: "Parameter Efficient Multimodal Transformers for Video Representation Learning", ICLR, 2021 (Seoul National University). [Paper]
- ClipBERT: "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling", CVPR, 2021 (UNC + Microsoft). [Paper][PyTorch]
- VLM: "VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding", ACL Findings, 2021 (Facebook). [Paper][PyTorch]
- VideoCLIP: "VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding", EMNLP, 2021 (Facebook). [Paper][PyTorch]
- VALUE: "VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation", NeurIPS (Datasets and Benchmarks), 2021 (Microsoft). [Paper][Website]
- TAN: "Temporal Alignment Networks for Long-term Video", CVPR, 2022 (Oxford). [Paper][Code (in construction)][Website]
- HD-VILA: "Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions", CVPR, 2022 (Microsoft). [Paper][GitHub]
- ATP: "Revisiting the "Video" in Video-Language Understanding", CVPR, 2022 (Stanford). [Paper][Website]
- ALPRO: "Align and Prompt: Video-and-Language Pre-training with Entity Prompts", CVPR, 2022 (Salesforce). [Paper][PyTorch]
- CLOP: "CLOP: Video-and-Language Pre-Training with Knowledge Regularizations", ACMMM, 2022 (Baidu). [Paper]
- LocVTP: "LocVTP: Video-Text Pre-training for Temporal Localization", ECCV, 2022 (Peking University). [Paper][PyTorch]
- FineCo: "Contrastive Video-Language Learning with Fine-grained Frame Sampling", AACL, 2022 (ICL, UK). [Paper]
- EMCL: "Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
- LF-VILA: "Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning", NeurIPS, 2022 (Microsoft). [Paper][GitHub]
- VATT-GR-CL: "Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization", NeurIPS, 2022 (Google). [Paper]
- LGDN: "LGDN: Language-Guided Denoising Network for Video-Language Modeling", NeurIPS, 2022 (Renmin University of China). [Paper]
- EgoVLP: "Egocentric Video-Language Pretraining", NeurIPS, 2022 (NUS). [Paper][PyTorch]
- LiteVL: "LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling", EMNLP, 2022 (Peking University). [Paper]
- Singularity: "Revealing Single Frame Bias for Video-and-Language Learning", arXiv, 2022 (UNC). [Paper]
- VIOLET: "VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling", arXiv, 2022 (Microsoft). [Paper][PyTorch]
- SimVTP: "SimVTP: Simple Video Text Pre-training with Masked Autoencoders", arXiv, 2022 (Tencent). [Paper][PyTorch (in construction)]
- VideoCoCa: "Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners", arXiv, 2022 (Google). [Paper]
- i-Code: "i-Code: An Integrative and Composable Multimodal Learning Framework", AAAI, 2023 (Microsoft). [Paper][Code (in construction)]
- TempCLR: "TempCLR: Temporal Alignment Representation with Contrastive Learning", ICLR, 2023 (Columbia). [Paper]
- MELTR: "MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models", CVPR, 2023 (Korea University). [Paper][PyTorch]
- VIOLETv2: "An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- LAVENDER: "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
- SViTT: "SViTT: Temporal Learning of Sparse Video-Text Transformers", CVPR, 2023 (Intel). [Paper][Website]
- TVTS: "Learning Transferable Spatiotemporal Representations from Natural Script Knowledge", CVPR, 2023 (Tencent). [Paper][PyTorch]
- HBI: "Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning", CVPR, 2023 (Peking University). [Paper][Code (in construction)][Website]
- All-in-One: "All in One: Exploring Unified Video-Language Pre-training", CVPR, 2023 (NUS). [Paper][PyTorch]
- VindLU: "VindLU: A Recipe for Effective Video-and-Language Pretraining", CVPR, 2023 (UNC). [Paper][PyTorch]
- Clover: "Clover: Towards A Unified Video-Language Alignment and Fusion Model", CVPR, 2023 (ByteDance). [Paper][PyTorch (in construction)]
- mPLUG-2: "mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video", ICML, 2023 (Alibaba). [Paper][Code (in construction)]
- BUS: "BUS: Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization", ICCV, 2023 (Alibaba). [Paper]
- UMT: "Unmasked Teacher: Towards Training-Efficient Video Foundation Models", ICCV, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- ?: "Long-range Multimodal Pretraining for Movie Understanding", ICCV, 2023 (Adobe). [Paper]
- EgoVLPv2: "EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone", ICCV, 2023 (Meta). [Paper][Website]
- STOA-VLP: "STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training", arXiv, 2023 (Harbin Institute of Technology). [Papaer]
- G-ViLM: "Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding", arXiv, 2023 (Google). [Paper]
- VLAB: "VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending", arXiv, 2023 (ByteDance). [Paper]
- i-Code-V2: "i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data", arXiv, 2023 (Microsoft). [Paper][PyTorch (in construction)]
- TVTSv2: "TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
- VFC: "Verbs in Action: Improving verb understanding in video-language models", arXiv, 2023 (Google). [Paper]
- Youku-mPLUG: "Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks", arXiv, 2023 (Alibaba). [Paper]
- VideoGLUE: "VideoGLUE: Video General Understanding Evaluation of Foundation Models", arXiv, 2023 (Google). [Paper]
- InternVid: "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- EVE: "EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE", arXiv, 2023 (Sun Yat-sen University). [Paper]
- Qwen-VL: "Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities", arXiv, 2023 (Alibaba). [Paper][PyTorch]
- BT-Adapter: "One For All: Video Conversation is Feasible Without Video Instruction Tuning", arXiv, 2023 (Tencent). [Paper]
- 3D:
- CLIP2: "CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data", CVPR, 2023 (Huawei). [Paper]
- 3D-VLP: "Context-aware Alignment and Mutual Masking for 3D-Language Pre-training", CVPR, 2023 (Sichuan University). [Paper][PyTorch]
- SDFusion: "SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation", CVPR, 2023 (Snap). [Paper][PyTorch][Website]
- 3D-VisTA: "3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment", ICCV, 2023 (Beijing Institute for General Artificial Intelligence (BIGAI)). [Paper][PyTorch][Website]
- RegionPLC: "RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding", arXiv, 2023 (HKU). [Paper][Website]
- 3DVLP: "Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding", arXiv, 2023 (Tsinghua). [Paper]
- CLIPXPlore: "CLIPXPlore: Coupled CLIP and Shape Spaces for 3D Shape Exploration", arXiv, 2023 (CUHK). [Paper]
- Point-PEFT: "Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- Vision-Audio-Text:
- VATT: "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text", NeurIPS, 2021 (Google). [Paper][Tensorflow]
- VideoCC: "Learning Audio-Video Modalities from Image Captions", ECCV, 2022 (Google). [Paper][Website]
- MUGEN: "MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration", ECCV, 2022 (Meta). [Paper][Website]
- VATLM: "VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
- CLIP4VLA: "Accommodating Audio Modality in CLIP for Multimodal Processing", AAAI, 2023 (Renmin University of China). [Paper]
- data2vec-2.0: "Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language", ICML, 2023 (Meta). [Paper][PyTorch]
- VALOR: "VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset", arXiv, 2023 (CAS). [Paper][PyTorch][Website]
- VAST: "VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset", arXiv, 2023 (CAS). [Paper]
- More than 3 modalities:
- Meta-Transformer: "Meta-Transformer: A Unified Framework for Multimodal Learning", arXiv, 2023 (CUHK). [Paper][Code (in construction)][Website]
- UnIVAL: "Unified Model for Image, Video, Audio and Language Tasks", arXiv, 2023 (Sorbonne University, France). [Paper][PyTorch][Website]
- ViT-Lens: "ViT-Lens: Towards Omni-modal Representations", arXiv, 2023 (Tencent). [Paper][PyTorch]
- General:
- Fast-and-Slow: "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers", CVPR, 2021 (DeepMind). [Paper]
- HTR: "Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning", CVPR, 2021 (Amazon). [Paper][PyTorch]
- TERN: "Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features", CBMI, 2021 (National Research Council, Italy). [Paper]
- VisualSparta: "VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search", arXiv, 2021 (CMU). [Paper]
- CCR-CCS: "More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints", arXiv, 2021 (Rutgers + Amazon). [Paper]
- MCProp: "Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching", ICLRW, 2022 (National Research Council, Italy). [Paper][PyTorch]
- TASK-former: "A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch", ECCV, 2022 (Georgia Tech). [Paper][Website]
- CODER: "CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval", ECCV, 2022 (Baidu). [Paper]
- ?: "Most and Least Retrievable Images in Visual-Language Query Systems", ECCV, 2022 (Old Dominion University, Virginia). [Paper]
- MACK: "MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching", NeurIPS, 2022 (CAS). [Paper]
- MLA: "Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval", NeurIPS, 2022 (Renmin University of China). [Paper]
- SpeechCLIP: "SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model", IEEE Workshop on Spoken Language Technology (SLT), 2022 (NTU). [Paper]
- LoopITR: "LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval", arXiv, 2022 (UNC). [Paper]
- TNLBT: "Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training", arXiv, 2022 (The University of Electro-Communications, Japan). [Paper]
- HiVLP: "HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval", arXiv, 2022 (Huawei). [Paper]
- ?: "Revising Image-Text Retrieval via Multi-Modal Entailment". arXiv, 2022 (Soochow University, China). [Paper]
- TokenFlow: "TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval", arXiv, 2022 (Kuaishou). [Paper]
- VLPCook: "Structured Vision-Language Pretraining for Computational Cooking", arXiv, 2022 (Sorbonne University, France). [Paper]
- UniVL-DR: "Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval", ICLR, 2023 (Northeastern University, China). [Paper]
- HREM: "Learning Semantic Relationship Among Instances for Image-Text Matching", CVPR, 2023 (USTC). [Paper]
- CHAN: "Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network", CVPR, 2023 (Zhejiang University). [Paper][PyTorch]
- ViLEM: "ViLEM: Visual-Language Error Modeling for Image-Text Retrieval", CVPR, 2023 (CAS). [Paper]
- SoftMask: "Multi-Modal Representation Learning with Text-Driven Soft Masks", CVPR, 2023 (SNU). [Paper]
- MetaPer: "Meta-Personalizing Vision-Language Models To Find Named Instances in Video", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
- DivE: "Improving Cross-Modal Retrieval with Set of Diverse Embeddings", CVPR, 2023 (POSTECH). [Paper][Website]
- Pic2Word: "Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval", CVPR, 2023 (Google). [Paper][PyTorch]
- ConaCLIP: "ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval", ACL Industry Track, 2023 (Alibaba). [Paper][PyTorch]
- FNE: "Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination", ACMMM, 2023 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch]
- HAT: "Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval", ACMMM, 2023 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch]
- STAIR: "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens", arXiv, 2023 (Apple). [Paper]
- ChatIR: "Chatting Makes Perfect - Chat-based Image Retrieval", arXiv, 2023 (The Hebrew University of Jerusalem, Israel). [Paper]
- TransAgg: "Zero-shot Composed Text-Image Retrieval", arXiv, 2023 (Shanghai Jiao Tong). [Paper][PyTorch][Website]
- Video:
- MMT: "Multi-modal Transformer for Video Retrieval", ECCV, 2020 (INRIA + Google). [Paper][Website]
- AYCE: "All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers", CVPRW, 2021 (University of Modena and Reggio Emilia). [Paper][PyTorch]
- HiT: "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval", ICCV, 2021 (Kuaishou). [Paper]
- Frozen: "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval", ICCV, 2021 (Oxford). [Paper][Pytorch][Website][Dataset]
- CLIP4Clip: "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval", arXiv, 2021 (Microsoft). [Paper][PyTorch]
- MMFT: "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval", CVPR, 2022 (Goethe University Frankfurt, Germany). [Paper]
- X-Pool: "X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval", CVPR, 2022 (Layer 6 AI, Toronto). [Paper][PyTorch][Website]
- MVPt: "It's Time for Artistic Correspondence in Music and Video", CVPR, 2022 (Adobe). [Paper][Website]
- OA-Trans: "Object-aware Video-language Pre-training for Retrieval", CVPR, 2022 (NUS). [Paper][PyTorch]
- BridgeFormer: "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR, 2022 (HKU). [Paper][PyTorch][Website]
- CenterCLIP: "CenterCLIP: Token Clustering for Efficient Text-Video Retrieval", SIGIR, 2022 (Zhejiang University). [Paper]
- X-CLIP: "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval", ACMMM, 2022 (Alibaba). [Paper]
- HiSE: "Boosting Video-Text Retrieval with Explicit High-Level Semantics", ACMMM, 2022 (Baidu). [Paper]
- TS2-Net: "TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval", ECCV, 2022 (Tencent). [Paper][PyTorch]
- LAFF: "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval", ECCV, 2022 (Renmin University of China). [Paper]
- ECLIPSE: "ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound", ECCV, 2022 (UNC). [Paper][PyTorch][Website]
- MILES: "MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval", ECCV, 2022 (HKU). [Paper][PyTorch]
- VTC: "VTC: Improving Video-Text Retrieval with User Comments", ECCV, 2022 (Unitary, UK). [Paper][PyTorch][Website]
- LINAS: "Learning Linguistic Association towards Efficient Text-Video Retrieval", ECCV, 2022 (CAS). [Paper][PyTorch]
- ?: "A Simple Transformer-Based Model for Ego4D Natural Language Queries Challenge", ECCVW, 2022 (UW-Madison). [Paper]
- ?: "Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval", NeurIPS, 2022 (Sun Yat-sen University). [Paper]
- ConTra: "ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval", ACCV, 2022 (University of Bristol, UK). [Paper]
- RaP: "RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval", EMNLP, 2022 (CAS). [Paper][PyTorch]
- MDMMT-2: "MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization", arXiv, 2022 (Huawei). [Paper]
- M2HF: "M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval", arXiv, 2022 (Tencent). [Paper]
- FIRE: "Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks", arXiv, 2022 (Meta). [Paper][PyTorch]
- Cross-Modal-Adapter: "Cross-Modal Adapter for Text-Video Retrieval", arXiv, 2022 (Tsinghua University). [Paper][PyTorch (in construction)]
- MAC: "Masked Contrastive Pre-Training for Efficient Video-Text Retrieval", arXiv, 2022 (Alibaba). [Paper]
- CLIP-ViP: "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment", ICLR, 2023 (Microsoft). [Paper][Code (in construction)]
- HiREST: "Hierarchical Video-Moment Retrieval and Step-Captioning", CVPR, 2023 (UNC + Meta). [Paper][PyTorch][Website]
- Cap4Video: "Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?", CVPR, 2023 (The University of Sydney). [Paper][PyTorch]
- CLIPPING: "CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval", CVPR, 2023 (Huawei). [Paper]
- CNVid-3.5M: "CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset", CVPR, 2023 (Ant Group). [Paper][GitHub (in construction)]
- CelebV-Text: "CelebV-Text: A Large-Scale Facial Text-Video Dataset", CVPR, 2023 (University of Sydney). [Paper][GitHub][Website]
- ReST: "Relational Space-Time Query in Long-Form Videos", CVPR, 2023 (Meta). [Paper]
- NaQ: "NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory", CVPR, 2023 (UT Austin). [Paper][PyTorch][Website]
- ?: "Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval", CVPR, 2023 (Columbia). [Paper][Code (in contruction)]
- VoP: "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval", CVPR, 2023 (Alibaba). [Paper][Code (in construction)][Website]
- SpotEM: "SpotEM: Efficient Video Search for Episodic Memory", ICML, 2023 (UT Austin). [Paper][Website]
- PromptSwitch: "Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval", ICCV, 2023 (University of Adelaide). [Paper][PyTorch (in construction)]
- ?: "Simple Baselines for Interactive Video Retrieval with Questions and Answers", ICCV, 2023 (Princeton). [Paper][Code (in construction)]
- MeVTR: "Multi-event Video-Text Retrieval", ICCV, 2023 (LMU Munich). [Paper][Code (in construction)]
- In-Style: "In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval", ICCV, 2023 (MPI). [Paper][Code (in construction)]
- ReGaDa: "Video-adverb retrieval with compositional adverb-action embeddings", BMVC, 2023 (University of Tübingen, Germany). [Paper][Code (in construction)][Website]
- DiffusionRet: "DiffusionRet: Generative Text-Video Retrieval with Diffusion Model", arXiv, 2023 (Peking University). [Paper]
- TextVR: "A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
- MASCOT: "Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval", arXiv, 2023 (?). [Paper]
- CrossTVR: "Fine-grained Text-Video Retrieval with Frozen Image Encoders", arXiv, 2023 (Alibaba). [Paper]
- TEFAL: "Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment", arXiv, 2023 (Amazon). [Paper]
- TeachCLIP: "TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval", arXiv, 2023 (Renmin University of China). [Paper]
- CoVR: "CoVR: Learning Composed Video Retrieval from Web Video Captions", arXiv, 2023 (Ecole des Ponts ParisTech (ENPC), France). [Paper][PyTorch][Website]
- LanguageBind: "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment", arXiv, 2023 (Peking). [Paper][PyTorch]
- Vision-Audio-Text:
- Multi-SK: "Preserving Modality Structure Improves Multi-Modal Learning", ICCV, 2023 (UCF). [Paper][Code (in construction)]
- Others:
- IRRA: "Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval", CVPR, 2023 (Wuhan University). [Paper][PyTorch]
- ZS-SBIR: "CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not", CVPR, 2023 (University of Surrey, UK). [Paper][PyTorch]
- ViML: "Language-Guided Music Recommendation for Video via Prompt Analogies", CVPR, 2023 (Adobe). [Paper][Website]
- Auto-ACD: "A Large-scale Dataset for Audio-Language Representation Learning", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)][Website]
- General:
- AttnGAN: "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks", CVPR, 2018 (Microsoft). [Paper][PyTorch]
- ControlGAN: "Controllable Text-to-Image Generation", NeurIPS, 2019 (Oxford). [Paper][PyTorch]
- DALL-E: "Zero-Shot Text-to-Image Generation", ICML, 2021 (OpenAI). [Paper][PyTorch][PyTorch (lucidrains)]
- CogView: "CogView: Mastering Text-to-Image Generation via Transformers", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
- Layout-VQGAN: "Text-to-Image Synthesis Based on Object-Guided Joint-Decoding Transformer", CVPR, 2022 (CAS). [Paper]
- Lafite: "Towards Language-Free Training for Text-to-Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
- LDM: "High-Resolution Image Synthesis with Latent Diffusion Models", CVPR, 2022 (LMU Munich). [Paper][PyTorch]
- AvatarCLIP: "AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars", SIGGRAPH, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
- StoryDALL-E: "StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation", ECCV, 2022 (UNC). [Paper][PyTorch]
- Make-A-Scene: "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors", ECCV, 2022 (Meta). [Paper][Video]
- TCTIG: "Trace Controlled Text to Image Generation", ECCV, 2022 (Beihang University). [Paper]
- CogView2: "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch]
- CLIPDraw: "CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders", NeurIPS, 2022 (Cross Compass, Japan). [Paper][PyTorch][Blog]
- Imagen: "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding", NeurIPS, 2022 (Google). [Paper][Website]
- ?: "Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark", NeurIPSW, 2022 (Boston + MIT + Columbia). [Paper]
- DALL-Eval: "DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers", arXiv, 2022 (UNC). [Paper][PyTorch]
- DALL-E-2: "Hierarchical Text-Conditional Image Generation with CLIP Latents", arXiv, 2022 (OpenAI). [Paper][Website]
- ?: "A very preliminary analysis of DALL-E 2", arXiv, 2022 (NYU). [Paper]
- GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models", arXiv, 2022 (OpenAI). [Paper][PyTorch]
- ?: "Discovering the Hidden Vocabulary of DALLE-2", arXiv, 2022 (UT Austin). [Paper]
- Parti: "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", arXiv, 2022 (Google). [Paper][GitHub][Website]
- Textual-Inversion: "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion", arXiv, 2022 (NVIDIA). [Paper][Website]
- VLMGAN: "Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks", arXiv, 2022 (Fudan University). [Paper]
- PDM: "Progressive Denoising Model for Fine-Grained Text-to-Image Generation", arXiv, 2022 (Meituan). [Paper]
- FS-VQG: "Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets", arXiv, 2022 (IIT Kharagpur). [Paper]
- Swinv2-Imagen: "Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation", arXiv, 2022 (Auckland University of Technology). [Paper]
- UniTune: "UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image", arXiv, 2022 (Google). [Paper]
- VSD: "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation", arXiv, 2022 (Tianjin University). [Paper][Code (in construction)]
- Lafite2: "Lafite2: Few-shot Text-to-Image Generation", arXiv, 2022 (SUNY, Buffalo). [Paper]
- eDiffi: "eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers", arXiv, 2022 (NVIDIA). [Paper][Website]
- SpaText: "SpaText: Spatio-Textual Representation for Controllable Image Generation", arXiv, 2022 (Meta). [Paper][Website]
- Story-LDM: "Make-A-Story: Visual Memory Conditioned Consistent Story Generation", arXiv, 2022 (UBC + Snap). [Paper]
- Structure-Diffusion: "Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis", arXiv, 2022 (UCSB + UC Santa Cruz). [Paper][PyTorch][Website]
- Re-Imagen: "Re-Imagen: Retrieval-Augmented Text-to-Image Generator", ICLR, 2023 (Google). [Paper]
- Prompt-to-Prompt: "Prompt-to-Prompt Image Editing with Cross Attention Control", ICLR, 2023 (Google). [Paper][PyTorch][Website]
- UniD3: "Unified Discrete Diffusion for Simultaneous Vision-Language Generation", ICLR, 2023 (NTU, Singapore). [Paper]
- T2P: "Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation", CVPR, 2023 (Fuxi AI Lab). [Paper]
- GLIGEN: "GLIGEN: Open-Set Grounded Text-to-Image Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch][Website]
- MAGVLT: "MAGVLT: Masked Generative Vision-and-Language Transformer", CVPR, 2023 (Kakao). [Paper]
- ReCo: "ReCo: Region-Controlled Text-to-Image Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- GALIP: "GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis", CVPR, 2023 (Nanjing University of Posts and Telecommunications). [Paper][PyTorch]
- DreamBooth: "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation", CVPR, 2023 (Google). [Paper][GitHub][Website]
- RIATIG: "RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts", CVPR, 2023 (Washington University in St. Louis). [Paper]
- ERNIE-ViLG-2.0: "ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts", CVPR, 2023 (Baidu). [Paper][Website]
- GigaGAN: "Scaling up GANs for Text-to-Image Synthesis", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
- Shifted-Diffusion: "Shifted Diffusion for Text-to-image Generation", CVPR, 2023 (ByteDance). [Paper][PyTorch]
- Specialist-Diffusion: "Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style", CVPR, 2023 (Picsart). [Paper][Website]
- ?: "Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation", CVPR, 2023 (CyberAgent, Japan). [Paper]
- Custom-Diffusion: "Multi-Concept Customization of Text-to-Image Diffusion", CVPR, 2023 (Adobe). [Paper]
- UniDiffuser: "One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale", ICML, 2023 (Tsinghua University). [Paper][Pytorch]
- Muse: "Muse: Text-To-Image Generation via Masked Generative Transformers", ICML, 2023 (Google). [Paper][Website]
- RA-CM3: "Retrieval-Augmented Multimodal Language Modeling", ICML, 2023 (Meta). [Paper]
- StyleGAN-T: "StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis", ICML, 2023 (NVIDIA). [Paper][PyTorch][Website]
- VD: "Versatile Diffusion: Text, Images and Variations All in One Diffusion Model", ICCV, 2023 (Oregon). [Paper][PyTorch]
- DiT: "Scalable Diffusion Models with Transformers", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
- E4T: "Designing an Encoder for Fast Personalization of Text-to-Image Models", arXiv, 2023 (NVIDIA). [Paper][Website]
- ?: "Controlled and Conditional Text to Image Generation with Diffusion Prior", arXiv, 2023 (Adobe). [Paper]
- Lformer: "Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding", arXiv, 2023 (Zhejiang University). [Paper]
- UMM-Diffusion: "Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation", arXiv, 2023 (Peking University). [Paper]
- TIFA: "TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering", arXiv, 2023 (UW). [Paper][Code (in construction)][Website]
- ToMESD: "Token Merging for Fast Stable Diffusion", arXiv, 2023 (Georgia Tech). [Paper][PyTorch]
- layout-guidance: "Training-Free Layout Control with Cross-Attention Guidance", arXiv, 2023 (Oxford). [Paper][PyTorch][Website]
- HRS-Bench: "HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models", arXiv, 2023 (KAUST). [Paper][GitHub][Website]
- SeedSelect: "It is all about where you start: Text-to-image generation with seed selection", arXiv, 2023 (Bar-Ilan University, Israel). [Paper]
- DisenBooth: "DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation", arXiv, 2023 (Tsinghua). [Paper]
- VideoOFA: "VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation", arXiv, 2023 (Meta). [Paper]
- FastComposer: "FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention", arXiv, 2023 (MIT). [Paper][PyTorch][Website]
- LLMScore: "LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation", arXiv, 2023 (UCSB). [Paper][PyTorch]
- CoDi: "Any-to-Any Generation via Composable Diffusion", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
- ?: "The CLIP Model is Secretly an Image-to-Prompt Converter", arXiv, 2023 (Xidian University). [Paper]
- PoS-subspaces: "Parts of Speech-Grounded Subspaces in Vision-Language Models", arXiv, 2023 (Queen Mary University of London). [Paper][PyTorch (in construction)][Website]
- VPGen: "Visual Programming for Text-to-Image Generation and Evaluation", arXiv, 2023 (UNC). [Paper][PyTorch][Website]
- BLIP-Diffusion: "BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing", arXiv, 2023 (Salesforce). [Paper][Code (in construction)][Website]
- SeeCoder: "Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models", arXiv, 2023 (Picsart). [Paper][PyTorch]
- GILL: "Generating Images with Multimodal Language Models", arXiv, 2023 (CMU). [Paper][Code (in construction)][Website]
- CAC: "Localized Text-to-Image Generation for Free via Cross Attention Control", arXiv, 2023 (CMU). [Paper]
- CLIPAG: "CLIPAG: Towards Generator-Free Text-to-Image Generation", arXiv, 2023 (Technion, Israel). [Paper]
- PACGen: "Generate Anything Anywhere in Any Scene", arXiv, 2023 (UW Madison). [Paper][Code (in construction)][Website]
- SPAE: "SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs", arXiv, 2023 (Google). [Paper]
- DA-Score: "Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback", arXiv, 2023 (ANU). [Paper][Code (in construction)][Website]
- HyperDreamBooth: "HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models", arXiv, 2023 (Google). [Paper][Website]
- ?: "Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models", arXiv, 2023 (NVIDIA). [Paper][Website]
- GORS: "T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation", arXiv, 2023 (HKU). [Paper][Website][PyTorch]
- IP-Adapter: "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models", arXiv, 2023 (Tencent). [Paper][Website]
- ORES: "ORES: Open-vocabulary Responsible Visual Synthesis", arXiv, 2023 (Microsoft). [Paper]
- CM3Leon: "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", arXiv, 2023 (Meta). [Paper]
- DreamLLM: "DreamLLM: Synergistic Multimodal Comprehension and Creation", arXiv, 2023 (Megvii). [Paper][Code (in construction)][Website]
- FreeU: "FreeU: Free Lunch in Diffusion U-Net", arXiv, 2023 (NTU, Singapore). [Paper][Website][Code (in construction)]
- Emu: "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack", arXiv, 2023 (Meta). [Paper]
- PixArt-α: "PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis", arXiv, 2023 (Huawei). [Paper][Website]
- Kosmos-G: "Kosmos-G: Generating Images in Context with Multimodal Large Language Models", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
- AlignProp: "Aligning Text-to-Image Diffusion Models with Reward Backpropagation", arXiv, 2023 (CMU). [Paper][PyTorch][Website]
- Idea2Img: "Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation", arXiv, 2023 (Microsoft). [Paper][Website]
- EasyGen: "Making Multimodal Generation Easier: When Diffusion Models Meet LLMs", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
- Video:
- Imagen-Video: "Imagen Video: High Definition Video Generation with Diffusion Models", arXiv, 2022 (Google). [Paper][Website]
- Phenaki: "Phenaki: Variable Length Video Generation From Open Domain Textual Description", arXiv, 2022 (Google). [Paper][PyTorch (LAION-AI, in construction)][Website]
- ?: "Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization", arXiv, 2022 (CMU). [Paper][PyTorch][Website]
- MagicVideo: "MagicVideo: Efficient Video Generation With Latent Diffusion Models", arXiv, 2022 (ByteDance). [Paper][Website]
- CogVideo: "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers", ICLR, 2023 (Tsinghua University) [Paper][GitHub (in construction)]
- Make-A-Video: "Make-A-Video: Text-to-Video Generation without Text-Video Data", ICLR, 2023 (Meta). [Paper]
- VideoLDM: "Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models", CVPR, 2023 (NVIDIA). [Paper][Website]
- MMVG: "Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation", CVPR, 2023 (Meta). [Paper]
- MM-Diffusion: "MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- PYoCo: "Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models", ICCV, 2023 (NVIDIA). [Paper][Website]
- Text2Video-Zero: "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators", ICCV, 2023 (Picsart). [Paper][Code (in construction)]
- Text2Performer: "Text2Performer: Text-Driven Human Video Generation", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)][Website]
- VideoFactory: "VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation", arXiv, 2023 (Microsoft). [Paper]
- Video-Adapter: "Probabilistic Adaptation of Text-to-Video Models", arXiv, 2023 (DeepMind). [Paper][Website]
- SimDA: "SimDA: Simple Diffusion Adapter for Efficient Video Generation", arXiv, 2023 (Fudan). [Paper][Website]
- LVD: "LLM-grounded Video Diffusion Models", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
- 3D:
- Magic3D: "Magic3D: High-Resolution Text-to-3D Content Creation", CVPR, 2023 (NVIDIA). [Paper][Website]
- CLIP-Sculptor: "CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language", CVPR, 2023 (Autodesk). [Paper][Website]
- Diffusion-SDF: "Diffusion-SDF: Text-to-Shape via Voxelized Diffusion", CVPR, 2023 (Tsinghua). [Paper][PyTorch][Website]
- TAPS3D: "TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision", CVPR, 2023 (Bytedance). [Paper][PyTorch]
- Dream3D: "Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models", CVPR, 2023 (Tencent). [Paper][Website]
- ATT3D: "ATT3D: Amortized Text-To-3D Object Synthesis", arXiv, 2023 (NVIDIA). [Paper][Website]
- InstructP2P: "InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions", arXiv, 2023 (Tencent). [Paper]
- ATT3D: "ATT3D: Amortized Text-to-3D Object Synthesis", arXiv, 2023 (NVIDIA). [Paper][Website]
- SDS-Complete: "Point-Cloud Completion with Pretrained Text-to-image Diffusion Models", arXiv, 2023 (NVIDIA). [Paper][Website]
- Michelangelo: "Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation", arXiv, 2023 (Tencent). [Paper][[Code (in construction)(https://github.com/NeuralCarver/michelangelo)]][Website]
- DiffTF: "Large-Vocabulary 3D Diffusion Model with Transformer", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)][Website]
- Others:
- DiffGesture: "Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation", CVPR, 2023 (HKU). [Paper][PyTorch]
- CondFoleyGen: "Conditional Generation of Audio from Video via Foley Analogies", CVPR, 2023 (UMich). [Paper][PyTorch (in construction)][Website]
- Physics-Diffusion: "Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos", CVPR, 2023 (IBM). [Paper][PyTorch][Website]
- RACER: "Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards", CVPR, 2023 (Dalian University of Technology). [Paper][Code (in construction)]
- ReVISE: "ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
- MAV3D: "Text-To-4D Dynamic Scene Generation", ICML, 2023 (Meta). [Paper][Website]
- LORIS: "Long-Term Rhythmic Video Soundtracker", ICML, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- NExT-GPT: "NExT-GPT: Any-to-Any Multimodal LLM", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
- CLIP-Adapter: "CLIP-Adapter: Better Vision-Language Models with Feature Adapters", arXiv, 2021 (Shanghai AI Lab). [Paper][PyTorch]
- CoCoOp: "Conditional Prompt Learning for Vision-Language Models", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
- ProDA: "Prompt Distribution Learning", CVPR, 2022 (Huawei). [Paper]
- VPT: "Visual Prompt Tuning", ECCV, 2022 (Cornell). [Paper][PyTorch]
- PerVL: "This is my unicorn, Fluffy": Personalizing frozen vision-language representations", ECCV, 2022 (NVIDIA). [Paper][PyTorch]
- OrdinalCLIP: "OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch]
- BeamCLIP: "Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching", NeurIPS, 2022 (LG). [Paper]
- TPT: "Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models", NeurIPS, 2022 (NVIDIA). [Paper][PyTorch][Website]
- CoOp: "Learning to Prompt for Vision-Language Models", IJCV, 2022 (NTU, Singapore). [Paper][PyTorch]
- LASP: "Language-Aware Soft Prompting for Vision & Language Foundation Models", CVPR, 2023 (Samsung). [Paper][Website]
- VPT: "Variational prompt tuning improves generalization of vision-language models", arXiv, 2022 (Samsung). [Paper]
- CAVPT: "Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
- Visual-Prompting: "Exploring Visual Prompts for Adapting Large-Scale Models", arXiv, 2022 (MIT). [Paper][PyTorch][Website]
- PGN: "Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers", arXiv, 2022 (University of Amsterdam). [Paper][PyTorch]
- UPT: "Unified Vision and Language Prompt Learning", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
- CPL: "CPL: Counterfactual Prompt Learning for Vision and Language Models", arXiv, 2022 (UC Santa Cruz). [Paper]
- PTP: "Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models", arXiv, 2022 (Baidu). [Paper]
- MVLPT: "Multitask Vision-Language Prompt Tuning", arXiv, 2022 (Berkeley). [Paper][PyTorch]
- ?: "Task Bias in Vision-Language Models", arXiv, 2022 (Columbia). [Paper]
- UPL: "Unsupervised Prompt Learning for Vision-Language Models", arXiv, 2022 (Peking). [Paper][PyTorch]
- DeFo: "Learning to Decompose Visual Features with Latent Textual Prompts", ICLR, 2023 (UIUC). [Paper]
- PLOT: "Prompt Learning with Optimal Transport for Vision-Language Models", ICLR, 2023 (CMU). [Paper]
- ?: "Visual Classification via Description from Large Language Models", ICLR, 2023 (Columbia). [Paper]
- CSP: "Learning to Compose Soft Prompts for Compositional Zero-Shot Learning", ICLR, 2023 (Brown University). [Paper][PyTorch]
- CaFo: "Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- ?: "Multimodal Prompting with Missing Modalities for Visual Recognition", CVPR, 2023 (NYCU). [Paper][PyTorch][Website]
- DAM-VP: "Diversity-Aware Meta Visual Prompting", CVPR, 2023 (USTC). [Paper][PyTorch]
- ILM-VP: "Understanding and Improving Visual Prompting: A Label-Mapping Perspective", CVPR, 2023 (Michigan State). [Paper][PyTorch]
- KgCoOp: "Visual-Language Prompt Tuning with Knowledge-guided Context Optimization", CVPR, 2023 (CAS). [Paper][PyTorch]
- BlackVIP: "BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning", CVPR, 2023 (University of Seoul). [Paper][PyTorch]
- EXPRES: "Learning Expressive Prompting With Residuals for Vision Transformers", CVPR, 2023 (Amazon). [Paper]
- ?: "Learning to Name Classes for Vision and Language Models", CVPR, 2023 (Huawei). [Paper]
- PMF: "Efficient Multimodal Fusion via Interactive Prompting", CVPR, 2023 (Zhejiang University). [Paper]
- MaPLe: "MaPLe: Multi-modal Prompt Learning", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
- HiPro: "Hierarchical Prompt Learning for Multi-Task Learning", CVPR, 2023 (JD). [Paper]
- DFSP: "Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning", CVPR, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
- TaI-DP: "Texts as Images in Prompt Tuning for Multi-Label Image Recognition", CVPR, 2023 (Tomorrow Advancing Life (TAL)). [Paper][PyTorch]
- ESPER: "Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning", CVPR, 2023 (Yonsei). [Paper][PyTorch]
- APT: "A-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting", CVPR, 2023 (Amazon). [Paper]
- VQT: "Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning", CVPR, 2023 (The Ohio State University (OSU)). [Paper]
- LaBo: "Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification", CVPR, 2023 (University of Pennsylvania). [Paper][PyTorch]
- TaskRes: "Task Residual for Tuning Vision-Language Models", CVPR, 2023 (NUS). [Paper][PyTorch]
- POUF: "POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models", ICML, 2023 (UT Austin). [Paper][PyTorch]
- ?: "Improving Visual Prompt Tuning for Self-supervised Vision Transformers", ICML, 2023 (SNU). [Paper][PyTorch]
- ZPE: "A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models", ICML, 2023 (Google). [Paper]
- CMPA: "Deeply Coupled Cross-Modal Prompt Learning", ACL Findings, 2023 (SenseTime). [Paper]
- PromptSRC: "Self-regulating Prompts: Foundational Model Adaptation without Forgetting", ICCV, 2023 (MBZUAI). [Paper][PyTorch][Website]
- SHIP: "Improving Zero-Shot Generalization for CLIP with Synthesized Prompts", ICCV, 2023 (CAS). [Paper]
- PTNL: "Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?", ICCV, 2023 (ByteDance). [Paper]
- E2VPT: "E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning", ICCV, 2023 (Rochester Institute of Technology, NY). [Paper][PyTorch]
- R-AMT: "Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models", ICCV, 2023 (Zhejiang University). [Paper][Code (in construction)][Website]
- DiffTPT: "Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning", ICCV, 2023 (A*STAR). [Paper][PyTorch (in construction)]
- KAPT: "Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models", ICCV, 2023 (Southern University of Science and Technology (SUSTech)). [Paper]
- RPO: "Read-only Prompt Optimization for Vision-Language Few-shot Learning", ICCV, 2023 (Korea University). [Paper][PyTorch]
- LoGoPrompt: "LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models", ICCV, 2023 (ShanghaiTech). [Paper][Website]
- DAPT: "Distribution-Aware Prompt Tuning for Vision-Language Models", ICCV, 2023 (Korea University). [Paper][Code (in construction)]
- GOPro: "GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning", BMVC, 2023 (IIT Bombay). [Paper][Code (in construction)]
- ALIGN: "Tuning Multi-mode Token-level Prompt Alignment across Modalities", NeurIPS, 2023 (Xidian University). [Paper][Code (in construction)]
- GraphAdapter: "GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph", NeurIPS, 2023 (NUS). [Paper][Code (in construction)]
- SeMap: "From Visual Prompt Learning to Zero-Shot Transfer: Mapping Is All You Need", arXiv, 2023 (CISPA, Germany). [Paper]
- R-Tuning: "R-Tuning: Regularized Prompt Tuning in Open-Set Scenarios", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- VPTM: "Rethinking Visual Prompt Learning as Masked Visual Token Modeling", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- GRAM: "Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models", arXiv, 2023 (Huawei). [Paper]
- PBPrompt: "Patch-Token Aligned Bayesian Prompt Learning for Vision-Language Models", arXiv, 2023 (Xidian University). [Paper]
- CTP-TFT: "Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models", arXiv, 2023 (Baidu). [Paper]
- POMP: "Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition", arXiv, 2023 (Amazon). [Paper][PyTorch]
- ?: "What does CLIP know about a red circle? Visual prompt engineering for VLMs", arXiv, 2023 (Oxford). [Paper]
- Robust-ProL: "Towards Robust Prompts on Vision-Language Models", arXiv, 2023 (Google). [Paper]
- ProVP: "Progressive Visual Prompt Learning with Contrastive Feature Re-formation", arXiv, 2023 (vivo, China). [Paper]
- ?: "Chain of Thought Prompt Tuning in Vision Language Models", arXiv, 2023 (Peking University). [Paper]
- Instruction-ViT: "Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT", arXiv, 2023 (University of Electronic Science and Technology of China). [Paper]
- VPGTrans: "Transfer Visual Prompt Generator across LLMs", arXiv, 2023 (NUS). [Paper][PyTorch][Website]
- DRPT: "DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning", arXiv, 2023 (Hong Kong Polytechnic University). [Paper][Code (in construction)]
- VCoT: "Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings", arXiv, 2023 (UCSB). [Paper]
- PMPO: "Multi-Prompt with Depth Partitioned Cross-Modal Learning", arXiv, 2023 (CAS). [Paper]
- Aurora: "Mode Approximation Makes Good Vision-Language Prompts", arXiv, 2023 (Peking). [Paper][PyTorch]
- DSD: "Discriminative Diffusion Models as Few-shot Vision and Language Learners", arXiv, 2023 (Google). [Paper]
- PLID: "Prompting Language-Informed Distribution for Compositional Zero-Shot Learning", arXiv, 2023 (Michigan State). [Paper]
- ConES: "ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models", arXiv, 2023 (Sichuan University). [Paper]
- LaFTer: "LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections", arXiv, 2023 (TU Graz, Austria). [Paper]
- ?: "Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning", arXiv, 2023 (Brown). [Paper][PyTorch]
- CoPrompt: "Consistency-guided Prompt Learning for Vision-Language Models", arXiv, 2023 (Queen’s University, Canada). [Paper]
- ProTeCt: "ProTeCt: Prompt Tuning for Hierarchical Consistency", arXiv, 2023 (UCSD). [Paper]
- FGVP: "Fine-Grained Visual Prompting", arXiv, 2023 (BAAI). [Paper]
- POP: "POP: Prompt Of Prompts for Continual Learning", arXiv, 2023 (Qualcomm). [Paper]
- GAVIE: "Aligning Large Multi-Modal Model with Robust Instruction Tuning", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
- NPT: "Bridging the Gap: Neural Collapse Inspired Prompt Tuning for Generalization under Class Imbalance", arXiv, 2023 (Zhejiang University). [Paper]
- APT: "Approximated Prompt Tuning for Vision-Language Pre-trained Models", arXiv, 2023 (Xiamen University). [Paper]
- CoPL: "Contextual Prompt Learning for Vision-Language Understanding", arXiv, 2023 (Adobe). [Paper]
- CiP: "Image Captions are Natural Prompts for Text-to-Image Models", arXiv, 2023 (The University of Sydney). [Paper]
- UP-DP: "UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models", arXiv, 2023 (Bosch). [Paper]
- DPL: "DPL: Decoupled Prompt Learning for Vision-Language Models", arXiv, 2023 (vivo). [Paper]
- DuAl-PT: "Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment", arXiv, 2023 (ByteDance). [Paper]
- DePT: "DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning", arXiv, 2023 (UCL). [Paper][PyTorch]
- Prompting4Debugging: "Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts", arXiv, 2023 (NYCU). [Paper]
- ?: "Language Models as Black-Box Optimizers for Vision-Language Models", arXiv, 2023 (CMU). [Paper]
- DePT: "DePT: Decoupled Prompt Tuning", arXiv, 2023 (University of Electronic Science and Technology of China). [Paper][PyTorch]
- LayoutLMv2: "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding", ACL, 2021 (Microsoft). [Paper][PyTorch]
- DocFormer: "DocFormer: End-to-End Transformer for Document Understanding", ICCV, 2021 (Amazon). [Paper]
- StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", ACMMM, 2021 (Baidu). [Paper][Paddle]
- LayoutXLM: "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding", arXiv, 2021 (Microsoft). [Paper][PyTorch]
- TableFormer: "TableFormer: Table Structure Understanding with Transformers", CVPR, 2022 (IBM). [Paper]
- TSRFormer: "TSRFormer: Table Structure Recognition with Transformers", ACMMM, 2022 (Microsoft). [Paper]
- ERNIE-mmLayout: "ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding", ACMMM, 2022 (Baidu). [Paper]
- Donut: "Donut: Document Understanding Transformer without OCR", ECCV, 2022 (NAVER). [Paper][PyTorch]
- I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
- MGDoc: "MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding", EMNLP, 2022 (Adobe). [Paper]
- DocEnTr: "DocEnTr: An End-to-End Document Image Enhancement Transformer", arXiv, 2022 (UAB, Spain). [Paper][PyTorch]
- DocSegTr: "DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer", arXiv, 2022 (UAB, Spain). [Paper]
- DiT: "DiT: Self-supervised Pre-training for Document Image Transformer", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
- LayoutLMv3: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking", arXiv, 2022 (Microsoft). [Paper][PyTorch]
- MATrIX: "MATrIX - Modality-Aware Transformer for Information eXtraction", arXiv, 2022 (Amazon). [Paper]
- VLCDoC: "VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification", arXiv, 2022 (La Rochelle University, France). [Paper]
- Bi-VLDoc: "Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding", arXiv, 2022 (Alibaba). [Paper]
- TRUST: "TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers", arXiv, 2022 (Baidu). [Paper]
- Hi-VT5: "Hierarchical multimodal transformers for Multi-Page DocVQA", arXiv, 2022 (UAB, Spain). [Paper]
- OCR-VQGAN: "OCR-VQGAN: Taming Text-within-Image Generation", WACV, 2023 (UAB, Spain). [Paper]
- PIXEL: "Language Modelling with Pixels", ICLR, 2023 (University of Copenhagen, Denmark). [Paper]
- Spotlight: "Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus", ICLR, 2023 (Google). [Paper]
- MaskDoc: "Masked Visual-Textual Prediction for Document Image Representation Pretraining", ICLR, 2023 (Baidu). [Paper]
- StrucTexTv2: "StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training", ICLR, 2023 (Baidu). [Paper][Paddle]
- FlexDM: "Towards Flexible Multi-modal Document Models", CVPR, 2023 (CyberAgent, Japan). [Paper][Tensorflow][Website]
- MUI: "Mobile User Interface Element Detection Via Adaptively Prompt Tuning", CVPR, 2023 (Ant Group). [Paper][GitHub (in construction)]
- UDOP: "Unifying Vision, Text, and Layout for Universal Document Processing", CVPR, 2023 (Microsoft). [Paper][PyTorch]
- M6Doc: "M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis", CVPR, 2023 (South China University of Technology). [Paper][GitHub]
- VGT: "Vision Grid Transformer for Document Layout Analysis", ICCV, 2023 (Alibaba). [Paper][PyTorch]
- SeRum: "Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration", ICCV, 2023 (Tencent). [Paper]
- DocTr: "DocTr: Document Transformer for Structured Information Extraction in Documents", ICCV, 2023 (Amazon). [Paper]
- FormNetV2: "FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction", ACL, 2023 (Google). [Paper]
- mmc4: "Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text", arXiv, 2023 (AI2). [Paper][GitHub (in construction)]
- DUBLIN: "DUBLIN - Document Understanding By Language-Image Network", arXiv, 2023 (Microsoft). [Paper]
- DocFormerv2: "DocFormerv2: Local Features for Document Understanding", arXiv, 2023 (Amazon). [Paper]
- DocumentCLIP: "DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents", arXiv, 2023 (Adobe). [Paper][PyTorch]
- Kosmos-2.5: "Kosmos-2.5: A Multimodal Literate Model", arXiv, 2023 (Microsoft). [Paper]
- UReader: "UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model", arXiv, 2023 (Alibaba). [Paper]
- Transfer Learning/Adaptation/Distillation:
- FLYP: "Finetune like you pretrain: Improved finetuning of zero-shot vision models", CVPR, 2023 (CMU). [Paper][PyTorch]
- Pi-Tuning: "Pi-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation", ICML, 2023 (HKU). [Paper][Code (in construction)]
- OCRA: "Cross-Modal Fine-Tuning: Align then Refine", ICML, 2023 (CMU + HP). [Paper][PyTorch]
- TeS: "Improved Visual Fine-tuning with Natural Language Supervision", arXiv, 2023 (Alibaba). [Paper]
- Paxion: "Paxion: Patching Action Knowledge in Video-Language Foundation Models", arXiv, 2023 (UIUC). [Paper][PyTorch]
- RLCF: "Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models", arXiv, 2023 (Zhejiang University). [Paper][Code (in construction)]
- LMAT: "Can Large Pre-trained Models Help Vision Models on Perception Tasks?", arXiv, 2023 (Huawei). [Paper][Website (in construction)]
- TaCA: "TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
- ProbVLM: "ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models", arXiv, 2023 (University of Tubingen, Germany). [Paper]
- CLIP-KD: "CLIP-KD: An Empirical Study of Distilling CLIP Models", arXiv, 2023 (CAS). [Paper][Code (in construction)]
- Zero-Shot:
- CuPL: "What does a platypus look like? Generating customized prompts for zero-shot image classification", arXiv, 2022 (UW). [Paper][PyTorch]
- SMs: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 (Google). [Paper][GitHub][Website]
- iCLIP: "iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition", CVPR, 2023 (Microsoft). [Paper]
- DiffDis: "DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability", ICCV, 2023 (Huawei). [Paper]
- V-GLOSS: "Visually-Grounded Descriptions Improve Zero-Shot Image Classification", arXiv, 2023 (University of Alberta, Canada). [Paper]
- ?: "Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness", arXiv, 2023 (Amazon). [Paper]
- UniFine: "UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding", arXiv, 2023 (Columbia). [Paper][Code (in construction)]
- Cheetah: "Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions", arXiv, 2023 (Zhejiang). [Paper]
- X-Shot:
- Tip-Adapter: "Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification", ECCV, 2022 (Shanghai AI Lab). [Paper][PyTorch]
- VidIL: "Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners", NeurIPS, 2022 (UIUC). [Paper][PyTorch]
- ComCLIP: "ComCLIP: Training-Free Compositional Image and Text Matching", arXiv, 2022 (UC Santa Cruz). [Paper]
- TCT: "Efficient Zero-shot Visual Search via Target and Context-aware Transformer", arXiv, 2022 (Baylor College of Medicine, TX). [Paper]
- ?: "Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning", ICLR, 2023 (University of Amsterdam). [Paper]
- ?: "Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models", CVPR, 2023 (CMU). [Paper]
- SADA: "Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment", CVPR, 2023 (Huawei). [Paper][PyTorch]
- APE: "Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- LFA: "Black Box Few-Shot Adaptation for Vision-Language models", arXiv, 2023 (Samsung). [Paper]
- ?: "Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime", arXiv, 2023 (DeepMind). [Paper]
- Proto-CLIP: "Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning", arXiv, 2023 (UT Dallas). [Paper]
- NtUA: "Noise-Tolerant Unsupervised Adapter for Vision-Language Models", arXiv, 2023 (MBZUAI). [Paper]
- SeCAt: "Small Visual Language Models can also be Open-Ended Few-Shot Learners", arXiv, 2023 (UvA). [Paper]
- Referring Image Segmentation:
- VLT: "Vision-Language Transformer and Query Generation for Referring Segmentation", ICCV, 2021 (NTU, Singapore). [Paper][Tensorflow]
- CRIS: "CRIS: CLIP-Driven Referring Image Segmentation", CVPR, 2022 (University of Sydney). [Paper]
- LAVT: "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation", CVPR, 2022 (Oxford). [Paper]
- ReSTR: "ReSTR: Convolution-free Referring Image Segmentation Using Transformers", CVPR, 2022 (POSTECH). [Paper][Website]
- ReCLIP: "ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension", ACL, 2022 (AI2). [Paper]
- TSEG: "Weakly-supervised segmentation of referring expressions", arXiv, 2022 (INRIA). [Paper]
- ZS-RIS: "Zero-shot Referring Image Segmentation with Global-Local Context Features", CVPR, 2023 (Gwangju Institute of Science and Technology (GIST)). [Paper][PyTorch]
- PolyFormer: "PolyFormer: Referring Image Segmentation as Sequential Polygon Generation", CVPR, 2023 (Amazon). [Paper][Website]
- MCRES: "Meta Compositional Referring Expression Segmentation", CVPR, 2023 (Singapore University of Technology and Design). [Paper]
- ReLA: "GRES: Generalized Referring Expression Segmentation", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- CGFormer: "Contrastive Grouping With Transformer for Referring Image Segmentation", CVPR, 2023 (ShanghaiTech). [Paper][PyTorch]
- CCTF: "Learning To Segment Every Referring Object Point by Point", CVPR, 2023 (JD). [Paper][Code (in construction)]
- ETRIS: "Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
- DMMI: "Beyond One-to-One: Rethinking the Referring Image Segmentation", ICCV, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- TRIS: "Referring Image Segmentation Using Text Supervision", ICCV, 2023 (CUHK). [Paper][Code (in construction)]
- SnG: "Shatter and Gather: Learning Referring Image Segmentation with Text Supervision", ICCV, 2023 (POSTECH). [Paper]
- VLT: "VLT: Vision-Language Transformer and Query Generation for Referring Segmentation", TPAMI, 2023 (NTU, Singapore). [Paper]
- IREG: "Whether you can locate or not? Interactive Referring Expression Generation", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper][Code (in construction)]
- R-RIS: "Towards Robust Referring Image Segmentation", arXiv, 2023 (Peking). [Paper][Code (in construction)][Website]
- PVD: "Parallel Vertex Diffusion for Unified Visual Grounding", arXiv, 2023 (Peking University). [Paper]
- MMNet: "MMNet: Multi-Mask Network for Referring Image Segmentation", arXiv, 2023 (CAS). [Paper]
- LGFormer: "Linguistic Query-Guided Mask Generation for Referring Image Segmentation", arXiv, 2023 (Alibaba). [Paper]
- RISCLIP: "RISCLIP: Referring Image Segmentation Framework using CLIP", arXiv, 2023 (POSTECH). [Paper]
- EAVL: "EAVL: Explicitly Align Vision and Language for Referring Image Segmentation", arXiv, 2023 (CAS). [Paper]
- Ref-Diff: "Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models", arXiv, 2023 (Harbin Institute of Technology). [Paper][Code (in construction)]
- DuMoGa: "Towards Complex-query Referring Image Segmentation: A Novel Benchmark", arXiv, 2023 (NUS). [Paper]
- Referring Video Segmentation:
- ReferFormer: "Language as Queries for Referring Video Object Segmentation", CVPR, 2022 (HKU). [Paper][PyTorch]
- MTTR: "End-to-End Referring Video Object Segmentation with Multimodal Transformers", CVPR, 2022 (Technion - Israel Institute of Technology). [Paper][PyTorch]
- MANet: "Multi-Attention Network for Compressed Video Referring Object Segmentation", ACMMM, 2022 (CAS). [Paper][PyTorch]
- R2VOS: "Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus", ICCV, 2023 (Microsoft). [Paper][PyTorch][Website]
- OnlineRefer: "OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation", ICCV, 2023 (Megvii). [Paper][PyTorch]
- SgMg: "Spectrum-guided Multi-granularity Referring Video Object Segmentation", ICCV, 2023 (The University of Western Australia). [Paper][PyTorch]
- MeViS: "MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- CMA: "Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples", ICCV, 2023 (SUSTech). [Paper][PyTorch]
- TempCD: "Temporal Collection and Distribution for Referring Video Object Segmentation", ICCV, 2023 (ShanghaiTech). [Paper][Website]
- UniRef: "Segment Every Reference Object in Spatial and Temporal Spaces", ICCV, 2023 (HKU). [Paper]
- HTML: "HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation", ICCV, 2023 (University of Technology Sydney, UTS). [Paper][Website]
- SOC: "SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation", NeurIPS, 2023 (Tsinghua). [Paper][Code (in construction)]
- Locater: "Local-Global Context Aware Transformer for Language-Guided Video Segmentation", TPAMI, 2023 (Zhejiang). [Paper][PyTorch]
- LoSh: "LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation", arXiv, 2023 (King’s College London). [Paper]
- RefSAM: "RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation", arXiv, 2023 (National University of Defense Technology, China). [Paper][Code (in construction)]
- IFIRVOS: "Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation", arXiv, 2023 (Wuhan University). [Paper]
- LGCFS: "Learning Referring Video Object Segmentation from Weak Annotation", arXiv, 2023 (Shanghai AI Lab). [Paper]
- EPCFormer: "EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation", arXiv, 2023 (Hunan University). [Paper][Code (in construction)]
- FTEA: "Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation", arXiv, 2023 (Hangzhou Dianzi University). [Paper]
- Referring 3D Segmentation:
- Tracking:
- ModaMixer: "Divert More Attention to Vision-Language Tracking", NeurIPS, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
- TransRMOT: "Referring Multi-Object Tracking", CVPR, 2023 (Megvii). [Paper][PyTorch][Website]
- ModaMixer: "Divert More Attention to Vision-Language Object Tracking", arXiv, 2023 (Beijing Jiaotong University). [Paper][PyTorch]
- Analysis:
- MM-Explainability: "Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers", ICCV, 2021 (Tel Aviv). [Paper][PyTorch]
- ?: "Are Multimodal Transformers Robust to Missing Modality?", CVPR, 2022 (University of Delaware). [Paper]
- VL-InterpreT: "VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers", CVPR (demo), 2022 (Intel). [Paper][Website][Video]
- ?: "Understanding Attention for Vision-and-Language Tasks", International Conference on Computational Linguistics (COLING), 2022 (The University of Sydney). [Paper]
- VL-CheckList: "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
- ?: "Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding", CVPR, 2023 (Tel Aviv). [Paper][PyTorch][Website]
- Why-Prompt: "Doubly Right Object Recognition: A Why Prompt for Visual Rationales", CVPR, 2023 (Columbia). [Paper]
- CREPE: "CREPE: Can Vision-Language Foundation Models Reason Compositionally?", CVPR, 2023 (Stanford). [Paper]
- ZOOM: "Zero-shot Model Diagnosis", CVPR, 2023 (CMU). [Paper]
- ?: "On the Generalization of Multi-modal Contrastive Learning", ICML, 2023 (Peking). [Paper][PyTorch]
- ?: "Learning Concise and Descriptive Attributes for Visual Recognition", ICCV, 2023 (UCSD). [Paper]
- ?: "Interpreting CLIP's Image Representation via Text-Based Decomposition", arXiv, 2023 (Berkeley). [Paper][PyTorch][Website]
- Speaker Localization:
- ?: "The Right to Talk: An Audio-Visual Transformer Approach", ICCV, 2021 (University of Arkansas). [Paper]
- Multi-task:
- UniT: "Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
- Pix2Seq: "A Unified Sequence Interface for Vision Tasks", NeurIPS, 2022 (Google). [Paper]
- LAVIS: "LAVIS: A Library for Language-Vision Intelligence", arXiv, 2022 (Salesforce). [Paper][PyTorch]
- Unified-IO: "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks", ICLR, 2023 (AI2). [Paper][JAX][Website]
- ImageBind: "ImageBind: One Embedding Space To Bind Them All", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
- EgoT2: "Egocentric Video Task Translation", CVPR, 2023 (Meta). [Paper][Website]
- VTAGML: "Vision Transformer Adapters for Generalizable Multitask Learning", ICCV, 2023 (EPFL). [Paper][Website]
- CoCoCon: "Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models", arXiv, 2023 (AI2). [Paper][PyTorch][Website]
- VisionLLM: "VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- ONE-PEACE: "ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities", arXiv, 2023 (Alibaba). [Paper][PyTorch (in construction)]
- VideoLLM: "VideoLLM: Modeling Video Sequence with Large Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
- i-Code-Studio: "i-Code Studio: A Configurable and Composable Framework for Integrative AI", arXiv, 2023 (Microsoft). [Paper][Code (in construction)][Website]
- Tag2Text: "Tag2Text: Guiding Vision-Language Model via Image Tagging", arXiv, 2023 (OPPO). [Paper][PyTorch][Website]
- RAM: "Recognize Anything: A Strong Image Tagging Model", arXiv, 2023 (OPPO). [Paper][PyTorch][Website]
- InstructDiffusion: "InstructDiffusion: A Generalist Modeling Interface for Vision Tasks", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
- InstructCV: "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists", arXiv, 2023 (Peking + Berkeley). [Paper][PyTorch]
- Language-based Video Editing:
- M3L: "Language-based Video Editing via Multi-Modal Multi-Level Transformer", CVPR, 2022 (UCSB). [Paper]
- Video-P2P: "Video-P2P: Video Editing with Cross-attention Control", arXiv, 2023 (CUHK). [Paper][Website]
- FateZero: "FateZero: Fusing Attentions for Zero-shot Text-based Video Editing", arXiv, 2023 (Tencent). [Paper][PyTorch][Website]
- Make-A-Protagonist: "Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts", arXiv, 2023 (Huawei). [Paper][PyTorch][Website]
- Video Summarization:
- GPT2MVS: "GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization", ICMR, 2021 (BBC). [Paper]
- QVHighlights: "QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries", NeurIPS, 2021 (UNC). [Paper][PyTorch]
- HMT: "Hierarchical Multimodal Transformer to Summarize Videos", arXiv, 2021 (Xidian University). [Paper]
- ?: "Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention", ACMMM, 2022 (Adobe). [Paper]
- IV-Sum: "TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency", ECCV, 2022 (Google). [Paper][Website]
- A2Summ: "Align and Attend: Multimodal Summarization with Dual Contrastive Losses", CVPR, 2023 (Adobe). [Paper][Code (in construction)][Website]
- QD-DETR: "Query-Dependent Video Representation for Moment Retrieval and Highlight Detection", CVPR, 2023 (Sungkyunkwan University, Korea). [Paper][PyTorch]
- A2Summ: "Align and Attend: Multimodal Summarization with Dual Contrastive Losses", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
- CLC: "Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies", CVPR, 2023 (Tencent). [Paper][Code (in construction)]
- VideoXum: "VideoXum: Cross-modal Visual and Textural Summarization of Videos", arXiv, 2023 (OPPO). [Paper][Website]
- MH-DETR: "MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer", arXiv, 2023 (Nanjing University). [Paper]
- VisionaryVid: "Joint Moment Retrieval and Highlight Detection Via Natural Language Queries", arXiv, 2023 (Georgia Tech). [Paper][PyTorch]
- Robotics:
- CRT: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", IROS, 2021 (Keio University). [Paper]
- TraSeTR: "TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery", ICRA, 2022 (CUHK). [Paper]
- VLMbench: "VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation", NeurIPS (Datasets and Benchmarks), 2022 (UC Santa Cruz). [Paper][Pytorch][Website]
- Surgical-VQLA: "Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery", ICRA, 2023 (CUHK). [Paper][PyTorch]
- ?: "Distilling Internet-Scale Vision-Language Models into Embodied Agents", ICML, 2023 (DeepMind). [Paper]
- LIV: "LIV: Language-Image Representations and Rewards for Robotic Control", ICML, 2023 (UPenn). [Paper][PyTorch][Website]
- PaLM-E: "PaLM-E: An Embodied Multimodal Language Model", ICML, 2023 (Google). [Paper][Website]
- VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", ICML, 2023 (NVIDIA). [Paper][PyTorch][Website]
- GVCCI: "GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation", IROS, 2023 (SNU, Korea). [Paper]
- LACO: "Language-Conditioned Path Planning", CoRL, 2023 (Berkeley). [Paper][Code (in construction)][Website]
- Grounded-Decoding: "Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control", arXiv, 2023 (Google). [Paper][Website]
- MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", arXiv, 2023 (Google). [Paper][Website]
- ?: "Vision-Language Models as Success Detectors", arXiv, 2023 (DeepMind). [Paper]
- VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", arXiv, 2023 (Meta). [Paper][Website]
- HomeRobot: "HomeRobot: Open-Vocabulary Mobile Manipulation", arXiv, 2023 (Georgia Tech + Meta). [Paper][PyTorch][Website]
- TaPA: "Embodied Task Planning with Large Language Models", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper][PyTorch][Website]
- VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", arXiv, 2023 (Stanford). [Paper][Website]
- RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv, 2023 (DeepMind). [Paper][Website]
- Multi-modal Fusion:
- MICA: "Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion", ICCV, 2021 (Southwest Jiaotong University). [Paper]
- IFT: "Image Fusion Transformer", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
- PPT: "PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion", arXiv, 2021 (?). [Paper]
- TransFuse: "TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning", arXiv, 2022 (Fudan University). [Paper]
- SwinFuse: "SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images", arXiv, 2022 (Taiyuan University of Science and Technology). [Paper]
- ?: "Array Camera Image Fusion using Physics-Aware Transformers", arXiv, 2022 (University of Arizona). [Paper]
- CDDFuse: "CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion", CVPR, 2023 (ETHZ). [Paper][PyTorch]
- Human Interaction:
- Dyadformer: "Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions", ICCVW, 2021 (Universitat de Barcelona). [Paper]
- 3D:
- 3DRefTransformer: "3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language", WACV, 2022 (KAUST). [Paper][Website]
- EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning", arXiv, 2022 (Peking University). [Paper]
- PLA: "Language-driven Open-Vocabulary 3D Scene Understanding", CVPR, 2023 (ByteDance). [Paper][PyTorch][Website]
- VL-SAT: "VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud", CVPR, 2023 (Beihang University). [Paper][PyTorch]
- LERF: "LERF: Language Embedded Radiance Fields", ICCV, 2023 (Berkeley). [Paper][Website]
- ConceptFusion: "ConceptFusion: Open-set Multimodal 3D Mapping", arXiv, 2023 (MIT). [Paper][Website]
- CG3D: "CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition", arXiv, 2023 (JHU). [Paper][PyTorch][Website]
- DiffCLIP: "DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification", arXiv, 2023 (Beijing Institute of Technology). [Paper]
- LLM-Grounder: "LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent", arXiv, 2023 (UMich). [Paper][PyTorch][Website]
- 3D Scene Understanding:
- OpenScene: "OpenScene: 3D Scene Understanding with Open Vocabularies", CVPR, 2023 (Google). [Paper][PyTorch][Website]
- PartSLIP: "PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models", CVPR, 2023 (Qualcomm). [Paper]
- CLIP2Scene: "CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- PLA: "Language-driven Open-Vocabulary 3D Scene Understanding", CVPR, 2023 (ByteDance). [Paper][PyTorch][Website]
- 3D-Highlighter: "3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions", CVPR, 2023 (University of Chicago). [Paper][PyTorch][Website]
- OVSG: "Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs", CoRL, 2023 (Rutgers). [Paper][Code (in construction)]
- CLIP-FO3D: "CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP", arXiv, 2023 (Tsinghua University). [Paper]
- 3D-OVS: "3D Open-vocabulary Segmentation with Foundation Models", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)]
- OVO: "OVO: Open-Vocabulary Occupancy", arXiv, 2023 (Fudan). [Paper]
- SAM3D: "SAM3D: Segment Anything in 3D Scenes", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- Seal: "Segment Any Point Cloud Sequences by Distilling Vision Foundation Models", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch (in construction)]
- OpenMask3D: "OpenMask3D: Open-Vocabulary 3D Instance Segmentation", arXiv, 2023 (ETHZ). [Paper][Website (in construction)]
- Lowis3D: "Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding", arXiv, 2023 (HKU). [Paper]
- OpenIns3D: "OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation", arXiv, 2023 (Cambridge). [Paper][Website]
- ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", arXiv, 2023 (1University of Toronto + Universite de Montreal). [Paper][PyTorch][Website]
- Speech Recognition:
- AV-HuBERT: "Robust Self-Supervised Audio-Visual Speech Recognition", arXiv, 2022 (Meta). [Paper][PyTorch]
- ?: "Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition", arXiv, 2022 (Google). [Paper]
- AVFormer: "AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR", CVPR, 2023 (Google). [Paper]
- AV-RelScore: "Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring", CVPR, 2023 (KAIST). [Paper][PyTorch]
- SynthVSR: "SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision", CVPR, 2023 (Meta). [Paper]
- Emotion Recognition:
- ?: "A Pre-trained Audio-Visual Transformer for Emotion Recognition", ICASSP, 2022 (USC). [Paper]
- MDAN: "MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis", CVPR, 2022 (Tencent). [Paper]
- DMD: "Decoupled Multimodal Distilling for Emotion Recognition", CVPR, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
- Sound Separation:
- VoViT: "VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer", ECCV, 2022 (Universitat Pompeu Fabra, Spain). [Paper][PyTorch][Website]
- iQuery: "iQuery: Instruments as Queries for Audio-Visual Sound Separation", CVPR, 2023 (UCSD). [Paper][Code (in construction)]
- VAST: "Language-Guided Audio-Visual Source Separation via Trimodal Consistency", CVPR, 2023 (Boston University). [Paper][Website]
- AVIN: "Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization", ACMMM, 2023 (Northwestern Polytechnical University). [Paper][Code (in construction)]
- GAVS: "Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer", arXiv, 2023 (Renmin University of China). [Paper]
- Audio-Visual:
- AV-HuBERT: "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction", ICLR, 2022 (Meta). [Paper][PyTorch]
- AVCA: "Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language", CVPR, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
- TCaF: "Temporal and cross-modal attention for audio-visual zero-shot learning", ECCV, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
- AVA-Memory: "Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment", ECCV, 2022 (KAIST). [Paper]
- TVLT: "TVLT: Textless Vision-Language Transformer", NeurIPS, 2022 (UNC). [Paper][PyTorch]
- ANGIE: "Audio-Driven Co-Speech Gesture Video Generation", NeurIPS, 2022 (CUHK). [Paper][Website]
- MGN: "Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing", NeurIPS, 2022 (CMU + UT Austin). [Paper][PyTorch]
- FS-RIR: "Few-Shot Audio-Visual Learning of Environment Acoustics", NeurIPS, 2022 (UT Austin). [Paper][Website]
- u-HuBERT: "u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality", NeurIPS, 2022 (Meta). [Paper]
- PC-VAE: "Multimodal Transformer for Parallel Concatenated Variational Autoencoders", NeurIPSW, 2022 (USC). [Paper]
- AV-CAT: "Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers", SIGGRAPH Asia, 2022 (Tokyo Institute of Technology + Baidu). [Paper][Website]
- Audiovisual-MAE: "Audiovisual Masked Autoencoders", arXiv, 2022 (Google). [Paper]
- MTD: "Multimodal Transformer Distillation for Audio-Visual Synchronization", arXiv, 2022 (NTU). [Paper]
- AVE-CLIP: "AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization", WACV, 2023 (UT Austin). [Paper]
- CLIPSep: "CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos", ICLR, 2023 (Sony). [Paper]
- CAV-MAE: "Contrastive Audio-Visual Masked Autoencoder", ICLR, 2023 (MIT + IBM). [Paper]
- UnAV: "Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline", CVPR, 2023 (Southern University of Science and Technology). [Paper][PyTorch][Website]
- LAVISH: "Vision Transformers are Parameter-Efficient Audio-Visual Learners", CVPR, 2023 (UNC). [Paper][Pytorch][Website]
- OneAVM: "A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition", ICML, 2023 (CMU + UW Madison). [Paper][Code (in construction)]
- AdVerb: "AdVerb: Visually Guided Audio Dereverberation", ICCV, 2023 (Maryland). [Paper][Website]
- CIGN: "Class-Incremental Grouping Network for Continual Audio-Visual Learning", ICCV, 2023 (CMU). [Paper][PyTorch]
- MAViL: "MAViL: Masked Audio-Video Learners", NeurIPS, 2023 (Meta). [Paper][Code (in construction)]
- GestureDiffuCLIP: "GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents", arXiv, 2023 (Peking University). [Paper]
- MMViT: "MMViT: Multiscale Multiview Vision Transformers", arXiv, 2023 (Meta). [Paper]
- ?: "Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos" arXiv, 2023 (Meta). [Paper]
- Audio-Visual Localization/Segmentation:
- AVSBench: "Audio-Visual Segmentation", ECCV, 2022 (SenseTime). [Paper][PyTorch][Website]
- AV-SAM: "AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation", arXiv, 2023 (CMU + UT Dallas). [Paper]
- AUSS: "Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation", arXiv, 2023 (Fudan). [Paper]
- AuTR: "Annotation-free Audio-Visual Segmentation", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
- AVSegFormer: "AVSegFormer: Audio-Visual Segmentation with Transformer", arXiv, 2023 (Nanjing University). [Paper][PyTorch]
- SQD: "Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition", arXiv, 2023 (CMU). [Paper]
- DiffMAViL: "Diffusion Models as Masked Audio-Video Learners", arXiv, 2023 (Apple). [Paper]
- Audio Description:
- AutoAD: "AutoAD: Movie Description in Context", CVPR, 2023 (Oxford). [Paper][PyTorch (in construction)][Website][Website]
- AutoAD-II: "AutoAD II: The Sequel - Who, When, and What in Movie Audio Description", ICCV, 2023 (Oxford). [Paper][PyTorch (in construction)]
- Sound Localization:
- TURN: "Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization", NeurIPS, 2022 (Zhejiang University). [Paper][PyTorch (in construction)]
- AVGN: "Audio-Visual Grouping Network for Sound Localization from Mixtures", CVPR, 2023 (CMU). [Paper][PyTorch]
- Sentiment Analysis:
- Name Entity Recognition:
- FMIT: "Flat Multi-modal Interaction Transformer for Named Entity Recognition", International Conference on Computational Linguistics (COLING), 2022 (South China University of Technology). [Paper]
- Localization via Embodied Dialog:
- LED-Bert: "Transformer-based Localization from Embodied Dialog with Large-scale Pre-training", arXiv, 2022 (Georgia Tech). [Paper]
- Object Captioning:
- Conversation:
- VisProg: "Visual Programming: Compositional visual reasoning without training", CVPR, 2023 (AI2). [Paper][PyTorch][Website]
- LaVIN: "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models", NeurIPS, 2023 (Xiamen University). [Paper][PyTorch][Website]
- Visual-ChatGPT: "Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models", arXiv, 2023 (Microsoft). [Paper]
- MM-REACT: "MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action", arXiv, 2023 (Microsoft). [Paper][Code][Website]
- Video-ChatCaptioner: "Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions", arXiv, 2023 (KAUST). [Paper][PyTorch]
- Chameleon: "Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models", arXiv, 2023 (UCLA + Microsoft). [Paper][PyTorch][Website]
- MiniGPT-4: "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models", arXiv, 2023 (KAUST). [Paper][PyTorch][Website]
- ChatVideo: "ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System", arXiv, 2023 (Fudan). [Paper][Website]
- LLaMA-Adapter: "LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- LLaMA-Adapter-V2: "LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- Otter: "Otter: A Multi-Modal Model with In-Context Instruction Tuning", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
- LMEye: "LMEye: An Interactive Perception Network for Large Language Models", arXiv, 2023 (Meituan). [Paper]
- MultiModal-GPT: "MultiModal-GPT: A Vision and Language Model for Dialogue with Humans", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- InternChat: "InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- VideoChat: "VideoChat: Chat-Centric Video Understanding", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- InstructBLIP: "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning", arXiv, 2023 (Salesforce). [Paper][PyTorch]
- ArtGPT-4: "ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4", arXiv, 2023 (Anhui Polytechnic University). [Paper][PyTorch]
- EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", arXiv, 2023 (HKU). [Paper][PyTorch (in construction)][Website]
- PandaGPT: "PandaGPT: One Model To Instruction-Follow Them All", arXiv, 2023 (Tencent). [Paper][PyTorch][Website]
- Video-LLaMA: "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding", arXiv, 2023 (Alibaba). [Paper][PyTorch]
- MIMIC-IT: "MIMIC-IT: Multi-Modal In-Context Instruction Tuning", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
- Video-ChatGPT: "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models", arXiv, 2023 (MBZUAI). [Paper][PyTorch]
- LAMM: "LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark", arXiv, 2023 (Shanghai AI Lab). [Paper]
- ?: "Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models", arXiv, 2023 (Huawei). [Paper]
- AssistGPT: "AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
- Macaw-LLM: "Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration", arXiv, 2023 (Tencent). [Paper][PyTorch]
- Shikra: "Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic", arXiv, 2023 (SenseTime). [Paper][Code (in construction)]
- LLaVAR: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding", arXiv, 2023 (Stanford). [Paper][PyTorch][Website]
- Polite-Flamingo: "Visual Instruction Tuning with Polite Flamingo", arXiv, 2023 (Xiaobing.AI). [Paper]
- Lynx: "What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?", arXiv, 2023 (ByteDance). [Paper][Website]
- GPT4RoI: "GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- SVIT: "SVIT: Scaling up Visual Instruction Tuning", arXiv, 2023 (BAAI). [Paper]
- AmadeusGPT: "AmadeusGPT: a natural language interface for interactive animal behavioral analysis", arXiv, 2023 (EPFL). [Paper][Code (in construction)]
- ChatSpot: "ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning", arXiv, 2023 (Megvii). [Paper][Demo]
- 3D-LLM: "3D-LLM: Injecting the 3D World into Large Language Models", arXiv, 2023 (UCLA). [Paper][PyTorch (in construction)][Website]
- ?: "How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges", arXiv, 2023 (ETHZ). [Paper][GitHub (in construction)]
- MovieChat: "MovieChat: From Dense Token to Sparse Memory for Long Video Understanding", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
- AntGPT: "AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?", arXiv, 2023 (Brown). [Paper][Website]
- ?: "Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models", arXiv, 2023 (Google). [Paper]
- MM-Vet: "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities", arXiv, 2023 (Microsoft). [Paper][Code]
- Chat-3D: "Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
- LLaVA: "Visual Instruction Tuning", arXiv, 2023 (UW-Madison). [Paper][PyTorch][Website]
- StableLLaVA: "StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
- PVIT: "Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models", arXiv, 2023 (Tsinghua). [Paper]
- PointLLM: "PointLLM: Empowering Large Language Models to Understand Point Clouds", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
- Point-Bind: "Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following", arXiv, 2023 (CUHK). [Paper][PyTorch]
- ImageBind-LLM: "ImageBind-LLM: Multi-modality Instruction Tuning", arXiv, 2023 (Shanghai AI Lab). [Paper]
- ?: "An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models", arXiv, 2023 (Microsoft). [Paper][GitHub]
- InternLM-XComposer: "InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
- LLaVA-RLHF: "Aligning Large Multimodal Models with Factually Augmented RLHF", arXiv, 2023 (Berkeley + CMU + UIUC). [Paper][Code (in construction)][Website]
- AnyMAL: "AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model", arXiv, 2023 (Meta). [Paper]
- Muffin: "Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants", arXiv, 2023 (Tsinghua). [Paper]
- Pink: "Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs", arXiv, 2023 (Ant). [Paper][Code (in construction)]
- LLaVA-1.5: "Improved Baselines with Visual Instruction Tuning", arXiv, 2023 (UW Madison). [Paper][PyTorch][Website]
- MiniGPT-5: "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens", arXiv, 2023 (UC Santa Cruz). [Paper][PyTorch]
- Ferret: "Ferret: Refer and Ground Anything Anywhere at Any Granularity", arXiv, 2023 (Apple). [Paper][Code (in construction)]
- Visual Reasoning:
- BDC-Adapter: "BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning", BMVC, 2023 (SUSTech). [Paper]
- RPT: "Fine-Grained Regional Prompt Tuning for Visual Abductive Reasoning", arXiv, 2023 (A*STAR). [Paper]
- LRR: "Look, Remember and Reason: Visual Reasoning with Grounded Rationales", arXiv, 2023 (Qualcomm). [Paper]
- SDS-CLIP: "Augmenting CLIP with Improved Visio-Linguistic Reasoning", arXiv, 2023 (Maryland). [Paper]
- ?: "Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models", arXiv, 2023 (George Mason University). [Paper]
- ViCor: "ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models", arXiv, 2023 (UC Santa Cruz). [Paper]
- Tracking:
- Scene Graph:
- CaCao: "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World", arXiv, 2023 (Zhejiang University). [Paper]
- Egocentric Video:
- Dance Generation:
- TM2D: "TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
- Conceptual Understanding:
- ?: "Text-To-Concept (and Back) via Cross-Model Alignment", ICML, 2023 (Maryland). [Paper]
- ?: "Probing Conceptual Understanding of Large Visual-Language Models", arXiv, 2023 (UCF + SRI). [Paper]
- EAC: "Explain Any Concept: Segment Anything Meets Concept-Based Explanation", arXiv, 2023 (HKUST). [Paper]
- Model Merging:
- Visual Word Sense Disambiguation (VWSD):
- CADG: "Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information", ACL, 2023 (UMass). [Paper]
- Object Hallucination:
- POPE: "Evaluating Object Hallucination in Large Vision-Language Models", arXiv, 2023 (Renmin University of China). [Paper][Code (in construction)]
- Social Interaction:
- HIINT: "HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer", arXiv, 2023 (MIT). [Paper]
- Evaluation:
- Perception-Test: "Perception Test: A Diagnostic Benchmark for Multimodal Video Models", arXiv, 2023 (DeepMind). [Paper][GitHub]
- VLM-Probing: "Scalable Performance Analysis for Vision-Language Models", Joint Conference on Lexical and Computational Semantics (*SEM), 2023 (UMich). [Paper][PyTorch]
- VisualGPTScore: "VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores", arXiv, 2023 (CMU). [Paper][Code (in construction)][Website]
- LVLM-eHub: "LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch (in construction)]
- VisoGender: "VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution", arXiv, 2023 (Oxford). [Paper][PyTorch]
- MME: "MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
- MMBench: "MMBench: Is Your Multi-modal Model an All-around Player?", arXiv, 2023 (Shanghai AI Lab). [Paper][Website]
- Tiny-LVLM-eHub: "Tiny LVLM-eHub: Early Multimodal Experiments with Bard", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
- VisIT-Bench: "VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use", arXiv, 2023 (UW). [Paper][Website]
- MODE: "An Examination of the Compositionality of Large Generative Vision-Language Models", arXiv, 2023 (HKUST). [Paper]
- TouchStone: "TouchStone: Evaluating Vision-Language Models by Language Models", arXiv, 2023 (Alibaba). [Paper]
- Q-Bench: "Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision", arXiv, 2023 (NTU, Singapore). [Paper]
- PCA-EVAL: "Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond", arXiv, 2023 (Peking). [Paper][Code (in construction)]
- ReForm-Eval: "ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks", arXiv, 2023 (Fudan). [Paper]
- Robustness:
- Hierarchy-CLIP: "Improving Zero-shot Generalization and Robustness of Multi-modal Models", CVPR, 2023 (Google). [Paper][JAX][Website]
- ?: "Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning", ICML, 2023 (UCLA). [Paper]
- SGA: "Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models", ICCV, 2023 (Southern University of Science and Technology). [Paper]
- VLAttack: "VLAttack: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models", NeurIPS, 2023 (Pennsylvania State University). [Paper]
- AttackVLM: "On Evaluating Adversarial Robustness of Large Vision-Language Models", arXiv, 2023 (Singapore University of Technology and Design (SUTD)). [Paper][PyTorch (in construction)]
- Compositional Reasoning:
- Vocabulary-free Image Classification (VIC):
- Retrieval Augmentated Methods:
- ?: "Improving Image Recognition by Retrieving from Web-Scale Image-Text Data", CVPR, 2023 (Google). [Paper]
- NeRF:
- NeRDi: "NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors", CVPR, 2023 (Waymo). [Paper]
- Model Selection:
- LOVM: "LOVM: Language-Only Vision Model Selection", arXiv, 2023 (Stanford). [Paper]
- Multimodal Interaction:
- ?: "Learning Unseen Modality Interaction", arXiv, 2023 (University of Amsterdam). [Paper]
- Multimodal Translation:
- Noisy label detection:
- VDC: "VDC: Versatile Data Cleanser for Detecting Dirty Samples via Visual-Linguistic Inconsistency", arXiv, 2023 (CUHK). [Paper]
- Model Compression: