Skip to content

Latest commit

 

History

History
1381 lines (1359 loc) · 257 KB

README_multimodal.md

File metadata and controls

1381 lines (1359 loc) · 257 KB

(back to README.md and README_2.md for other categories)

Overview


Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

Multi-Modality

Visual Captioning

  • General:
    • SAT: "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention", ICML, 2015. [paper]
    • ETA-Transformer: "Entangled Transformer for Image Captioning", ICCV, 2019 (UTS). [Paper]
    • M2-Transformer: "Meshed-Memory Transformer for Image Captioning", CVPR, 2020 (UniMoRE). [Paper][PyTorch]
    • MCCFormers: "Describing and Localizing Multiple Changes with Transformers", ICCV, 2021 (AIST). [Paper][Website]
    • SATIC: "Semi-Autoregressive Transformer for Image Captioning", ICCVW, 2021 (Hefei University of Technology). [Paper][PyTorch]
    • DGCN: "Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning", ACMMM, 2021 (Wuhan University). [Paper]
    • CPTR: "CPTR: Full Transformer Network for Image Captioning", arXiv, 2021 (CAS). [Paper]
    • ReFormer: "ReFormer: The Relational Transformer for Image Captioning", arXiv, 2021 (Stony Brook University). [Paper]
    • LAViTeR: "LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation", arXiv, 2021 (University at Buffalo). [Paper]
    • LATGeO: "Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning", arXiv, 2021 (Gwangju Institute of Science and Technology). [Paper]
    • GEVST: "Geometry-Entangled Visual Semantic Transformer for Image Captioning", arXiv, 2021 (NTU, Singapore). [Paper]
    • GAT: "Geometry Attention Transformer with Position-aware LSTMs for Image Captioning", arXiv, 2021 (University of Electronic Science and Technology of China). [Paper]
    • PureT: "End-to-End Transformer Based Model for Image Captioning", AAAI, 2022 (CAS). [Paper]
    • VisualGPT: "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning", CVPR, 2022 (KAUST). [Paper][PyTorch]
    • ViTCAP: "Injecting Semantic Concepts into End-to-End Image Captioning", CVPR, 2022 (Microsoft). [Paper]
    • CLIP-Event: "CLIP-Event: Connecting Text and Images with Event Structures", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • ?: "Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning", CVPR, 2022 (Georgia Tech). [Paper][PyTorch]
    • CLIP4IDC: "CLIP4IDC: CLIP for Image Difference Captioning", CVPRW, 2022 (Aalto University, Finland). [Paper][Code (in construction)]
    • ?: "A Dual-Attentive Approach to Style-Based Image Captioning Using a CNN-Transformer Model", CVPRW, 2022 (The University of the West Indies, Jamaica). [Paper]
    • SpaCap3D: "Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds", IJCAI, 2022 (University of Sydney). [Paper][Code (in construction)][Website]
    • RA-Transformer: "Retrieval-Augmented Transformer for Image Captioning", International Conference on Content-based Multimedia Indexing (CMBI), 2022 (University of Modena and Reggio Emilia, Italy). [Paper]
    • GRIT: "GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features", ECCV, 2022 (Tohoku University + RIKEN AIP). [Paper][PyTorch]
    • ?: "Object-Centric Unsupervised Image Captioning", ECCV, 2022 (Meta). [Paper][PyTorch]
    • UEDVC: "Unifying Event Detection and Captioning as Sequence Generation via Pre-Training", ECCV, 2022 (Renmin University of China). [Paper][PyTorch]
    • TIger: "Explicit Image Caption Editing", ECCV, 2022 (Zhejiang University). [Paper][Code]
    • DML: "Learning Distinct and Representative Modes for Image Captioning", NeurIPS, 2022 (University of Adelaide, Australia). [Paper]
    • P2C: "Paraphrasing Is All You Need for Novel Object Captioning", NeurIPS, 2022 (NTU + CMU). [Paper]
    • BEST: "Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning", NeurIPS, 2022 (Microsoft). [Paper]
    • CapDec: "Text-Only Training for Image Captioning using Noise-Injected CLIP", EMNLP, 2022 (Tel Aviv). [Paper][Pytorch]
    • ?: "Focus! Relevant and Sufficient Context Selection for News Image Captioning", EMNLP Findings, 2022 (UC Davis). [Paper]
    • CVLNM: "Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning", IJCV, 2022 (Southeast University, China). [Paper][PyTorch]
    • ViNTER: "ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer", arXiv, 2022 (The University of Tokyo). [Paper]
    • VaT: "Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning", arXiv, 2022 (Tongji University). [Paper]
    • SCST-GEG: "Distincive Image Captioning via CLIP Guided Group Optimization", arXiv, 2022 (McGill University). [Paper]
    • ?: "Vision Transformer Based Model for Describing a Set of Images as a Story", arXiv, 2022 (The University of Western Australia). [Paper]
    • CLM: "Zero-shot Image Captioning by Anchor-augmented Vision-Language Space Alignment", arXiv, 2022 (CAS). [Paper]
    • PromptCap: "PromptCap: Prompt-Guided Task-Aware Image Captioning", arXiv, 2022 (UW). [Paper]
    • PTSN: "Progressive Tree-Structured Prototype Network for End-to-End Image Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
    • DDCap: "Exploring Discrete Diffusion Models for Image Captioning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • ARIC: "Aesthetically Relevant Image Captioning", AAAI, 2023 (Shenzhen University). [Paper][Code (in construction)]
    • UAIC: "Uncertainty-Aware Image Captioning", AAAI, 2023 (Meituan). [Paper]
    • LiMBeR: "Linearly Mapping from Image to Text Space", ICLR, 2023 (Brown University). [Paper]
    • DiscriTune: "Cross-Domain Image Captioning with Discriminative Finetuning", CVPR, 2023 (Universitat Pompeu Fabra (UPF), Spain). [Paper]
    • LIBRA: "Model-Agnostic Gender Debiased Image Captioning", CVPR, 2023 (Osaka University). [Paper]
    • A-CAP: "A-CAP: Anticipation Captioning with Commonsense Knowledge", CVPR, 2023 (The University of Tokyo). [Paper]
    • HAAV: "HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning", CVPR, 2023 (Georgia Tech). [Paper][Website]
    • ?: "Cross-Domain Image Captioning with Discriminative Finetuning", CVPR, 2023 (Universitat Pompeu Fabra (UPF), Spain). [Paper]
    • PAC-S: "Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation", CVPR, 2023 (UniMoRE, Italy). [Paper][PyTorch]
    • SCD-Net: "Semantic-Conditional Diffusion Networks for Image Captioning", CVPR, 2023 (JD). [Paper][PyTorch]
    • ConZIC: "ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing", CVPR, 2023 (Xidian University). [Paper][PyTorch]
    • SmallCap: "SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation", CVPR, 2023 (University of Lisbon, Portugal). [Paper][PyTorch]
    • LSML: "Crossing the Gap: Domain Generalization for Image Captioning", CVPR, 2023 (USTC). [Paper]
    • MuE: "You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model", CVPR, 2023 (NC State). [Paper]
    • OxfordTVG-HIC: "OxfordTVG-HIC: Can Machine Make Humorous Captions from Images?", ICCV, 2023 (Oxford). [Paper][Website]
    • ?: "Guiding Image Captioning Models Toward More Specific Captions", ICCV, 2023 (Google). [Paper]
    • ViECap: "Transferable Decoding with Visual Entities for Zero-Shot Image Captioning", ICCV, 2023 (Southern University of Science and Technology). [Paper][Code (in construction)]
    • PMA-Net: "With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning", ICCV, 2023 (University of Modena and Reggio Emilia (UniMoRE), Italy). [Paper][Code (in construction)]
    • SCORER: "Self-supervised Cross-view Representation Reconstruction for Change Captioning", ICCV, 2023 (CAS). [Paper][Code (in construction)]
    • TSG: "Transforming Visual Scene Graphs to Image Captions", ACL, 2023 (Southeast University, China). [Paper][PyTorch]
    • InfoMetIC: "InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation", ACL, 2023 (Renmin University of China). [Paper][Code (in construction)]
    • MultiCapCLIP: "MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning", ACL, 2023 (Peking). [Paper][PyTorch (in construction)]
    • Cur-VL: "Learning from Children: Improving Image-Caption Pretraining via Curriculum", ACL Findings, 2023 (Columbia). [Paper][Code (in construction)]
    • ?: "Text-Only Training for Visual Storytelling", ACMMM, 2023 (USTC). [Paper]
    • CgT-GAN: "CgT-GAN: CLIP-guided Text GAN for Image Captioning", ACMMM, 2023 (USTC). [Paper][PyTorch]
    • Re-ViLM: "Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning", arXiv, 2023 (NVIDIA). [Paper]
    • Knight: "From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • VTT: "Visual Transformation Telling", arXiv, 2023 (CAS). [Paper]
    • Caption-Anything: "Caption Anything: Interactive Image Description with Diverse Multimodal Controls", arXiv, 2023 (Southern University of Science and Technology). [Paper][PyTorch]
    • COLA: "COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?", arXiv, 2023 (Boston). [Paper]
    • ?: "Data Curation for Image Captioning with Text-to-Image Generative Models", arXiv, 2023 (University of Copenhagen, Denmark). [Paper]
    • TLC: "Simple Token-Level Confidence Improves Caption Correctness", arXiv, 2023 (Meta). [Paper]
    • VIVID: "Album Storytelling with Iterative Story-aware Captioning and Large Language Models", arXiv, 2023 (Peking). [Paper]
    • MCDG: "Text-Only Image Captioning with Multi-Context Data Generation", arXiv, 2023 (USTC). [Paper]
    • FuseCap: "FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions", arXiv, 2023 (Israel Institute of Technology). [Paper]
    • StoryGen: "Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch (in construction)][Website]
    • ?: "Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion", arXiv, 2023 (University of Milano-Bicocca, Italy). [Paper]
    • SITTA: "SITTA: A Semantic Image-Text Alignment for Image Captioning", arXiv, 2023 (Johannes Kepler University, Austria). [Paper][PyTorch]
    • MMNS: "Multimodal Neurons in Pretrained Text-Only Transformers", arXiv, 2023 (MIT). [Paper]
    • RegionBLIP: "RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • ?: "Visually-Aware Context Modeling for News Image Captioning", arXiv, 2023 (KU Leuven). [Paper]
  • Video:
    • Masked Transformers: "End-to-End Dense Video Captioning with Masked Transformer", CVPR, 2018 (UMich + Salesforce). [Paper][PyTorch]
    • BMT: "A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer", BMVC, 2020 (Tampere University, Finland). [Paper][PyTorch][Website]
    • ?: "Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers", Interspeech, 2021 (MERL). [Paper]
    • PDVC: "End-to-End Dense Video Captioning with Parallel Decoding", ICCV, 2021 (HKU + Southern University of Science and Technology). [Paper][PyTorch]
    • MV-GPT: "End-to-end Generative Pretraining for Multimodal Video Captioning", CVPR, 2022 (Google). [Paper]
    • VGCL: "Video-Guided Curriculum Learning for Spoken Video Grounding", ACMMM, 2022 (Zhejiang University). [Paper][PyTorch]
    • UVC-VI: "Aligning Source Visual and Target Language Domains for Unpaired Video Captioning", TPAMI, 2022 (Peking University). [Paper]
    • D2: "Dual-Level Decoupled Transformer for Video Captioning", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
    • VASTA: "Diverse Video Captioning by Adaptive Spatio-temporal Attention", arXiv, 2022 (University of Tubingen, Germany). [Paper]
    • VCRN: "Visual Commonsense-aware Representation Network for Video Captioning", arXiv, 2022 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch (in construction)]
    • RSFD: "Refined Semantic Enhancement towards Frequency Diffusion for Video Captioning", arXiv, 2022 (Wuhan University of Technology). [Paper][Code (in construction)]
    • VLTinT: "VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning", AAAI, 2023 (University of Arkansas). [Paper]
    • Vid2Seq: "Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning", CVPR, 2023 (Google). [Paper][Website]
    • TextKG: "Text with Knowledge Graph Augmented Transformer for Video Captioning", CVPR, 2023 (ByteDance). [Paper]
    • G2L: "G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory", ICCV, 2023 (Peking). [Paper]
    • CoCap: "Accurate and Fast Compressed Video Captioning", ICCV, 2023 (CAS). [Paper][PyTorch]
    • Movie101: "Movie101: A New Movie Understanding Benchmark", ACL, 2023 (Renmin University of China). [Paper][Code (in construction)]
    • VidChapters-7M: "VidChapters-7M: Video Chapters at Scale", NeurIPS (Datasets and Benchmarks), 2023 (INRIA). [Paper][PyTorch][Website]
    • ?: "Implicit and Explicit Commonsense for Multi-sentence Video Captioning", arXiv, 2023 (UBC). [Paper]
    • Video-Verbalization: "A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot", arXiv, 2023 (Adobe). [Paper]
    • Dense-VOC: "Dense Video Object Captioning from Disjoint Supervision", arXiv, 2023 (Google). [Paper]
    • ?: "Exploring the Role of Audio in Video Captioning", arXiv, 2023 (ByteDance). [Paper]
    • ZeroTA: "Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment", arXiv, 2023 (KAIST). [Paper]
    • Video-CSR: "Video-CSR: Complex Video Digest Creation for Visual-Language Models", arXiv, 2023 (ByteDance). [Paper]
  • 3D:
    • Vote2Cap-DETR: "End-to-End 3D Dense Captioning with Vote2Cap-DETR", CVPR, 2023 (Fudan). [Paper][PyTorch]
    • Cap3D: "Scalable 3D Captioning with Pretrained Models", arXiv, 2023 (UMich). [Paper][Dataset]
    • Vote2Cap-DETR++: "Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning", arXiv, 2023 (Fudan). [Paper][PyTorch]
  • Others:

[Back to Overview]

Visual Question Answering

  • General:
    • MCAN: "Deep Modular Co-Attention Networks for Visual Question Answering", CVPR, 2019 (Hangzhou Dianzi University). [Paper][PyTorch]
    • M4C: "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA", CVPR, 2020 (Facebook). [Paper]
    • SA-M4C: "Spatially Aware Multimodal Transformers for TextVQA", ECCV, 2020 (Georgia Tech). [Paper][PyTorch][Website]
    • ConClaT: "Contrast and Classify: Training Robust VQA Models", ICCV, 2021 (Georgia Tech). [Paper]
    • TRAR: "TRAR: Routing the Attention Spans in Transformer for Visual Question Answering", ICCV, 2021 (Xiamen University). [Paper]
    • UniQer: "Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue", ICCV, 2021 (Keio). [Paper]
    • TxT: "TxT: Crossmodal End-to-End Learning with Transformers", GCPR, 2021 (TU Darmstadt). [Paper]
    • ProTo: "ProTo: Program-Guided Transformer for Program-Guided Tasks", NeurIPS, 2021 (Georiga Tech). [Paper]
    • VisQA: "VisQA: X-raying Vision and Language Reasoning in Transformers", arXiv, 2021 (INSA-Lyon). [Paper][PyTorch]
    • Block-Skim: "Block-Skim: Efficient Question Answering for Transformer", AAAI, 2022 (* Shanghai Jiao Tong*). [Paper]
    • RelViT: "RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning", ICLR, 2022 (NVIDIA). [Paper] [PyTorch]
    • Hypergraph-Transformer: "Hypergraph Transformer: Weakly-supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering", ACL, 2022 (SNU). [Paper][Code (in construction)]
    • X-Trans2Cap: "X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning", CVPR, 2022 (CUHK). [Paper]
    • UTC: "UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog", CVPR, 2022 (Fudan). [Paper]
    • LaTr: "LaTr: Layout-Aware Transformer for Scene-Text VQA", CVPR, 2022 (Amazon). [Paper]
    • QAA: "Query and Attention Augmentation for Knowledge-Based Explainable Reasoning", CVPR, 2022 (University of Minnesota). [Paper][PyTorch]
    • WebQA: "WebQA: Multihop and Multimodal QA", CVPR, 2022 (CMU + Microsoft). [Paper][PyTorch][Website]
    • ?: "Efficient Adaptive Image-Language Learning for Visual Question Answering", CVPRW, 2022 (Google). [Paper]
    • cViL: "cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation", ICPR, 2022 (IIIT, Hyderabad). [Paper]
    • Distinguishing-VQA: "Overcoming Language Priors in Visual Question Answering via Distinguishing Superficially Similar Instances", COLING, 2022 (Nankai University). [Paper][Code (in construction)]
    • ?: "Weakly Supervised Grounding for VQA in Vision-Language Transformers", ECCV, 2022 (UCF). [Paper][PyTorch (in construction)]
    • MUST-VQA: "MUST-VQA: MUltilingual Scene-text VQA", ECCVW, 2022 (UAB, Spain). [Paper]
    • ?: "Training Vision-Language Models with Less Bimodal Supervision", Automated Knowledge Base Construction (AKBC), 2022 (Tel Aviv). [Paper]
    • REVIVE: "REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering", NeurIPS, 2022 (Microsoft). [Paper]
    • ScienceQA: "Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering", NeurIPS, 2022 (AI2). [Paper][PyTorch][Website]
    • FrozenBiLM: "Zero-Shot Video Question Answering via Frozen Bidirectional Language Models", NeurIPS, 2022 (INRIA). [Paper][PyTorch]
    • MuRAG: "MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text", EMNLP, 2022 (Google). [Paper]
    • MMBS: "Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • EnFoRe: "Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering", EMNLP, 2022 (UT Austin). [Paper]
    • CRIPP-VQA: "CRIPP-VQA: Counterfactual Reasoning about Implicit Physical Properties via Video Question Answering", EMNLP, 2022 (Arizona State University). [Paper][PyTorch][Website]
    • PnP-VQA: "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", EMNLP Findings, 2022 (Salesforce). [Paper]
    • TMN: "Transformer Module Networks for Systematic Generalization in Visual Question Answering", arXiv, 2022 (Fujitsu). [Paper]
    • ?: "On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering", arXiv, 2022 (Birla Institute of Technology Mesra, India). [Paper]
    • DST: "Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer", arXiv, 2022 (Hangzhou Dianzi University). [Paper]
    • PAVCR: "Attention Mechanism based Cognition-level Scene Understanding", arXiv, 2022 (Leibniz University of Hannover, Germany). [Paper]
    • TAG: "TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation", arXiv, 2022 (Maryland + Salesforce). [Paper][PyTorch]
    • UniCon: "UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering", arXiv, 2022 (University of Tokyo). [Paper]
    • CLOVE: "Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task", arXiv, 2022 (NUS). [Paper][Code (in construction)]
    • mVQA: "Towards Multi-Lingual Visual Question Answering", arXiv, 2022 (Google). [Paper]
    • CIB: "Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering", arXiv, 2022 (Xi'an Jiaotong University). [Paper]
    • ?: "Compressing And Debiasing Vision-Language Pre-Trained Models for Visual Question Answering", arXiv, 2022 (CAS). [Paper]
    • VLR: "Visually Grounded VQA by Lattice-based Retrieval", arXiv, 2022 (University of Bremen, Germany). [Paper]
    • CMCL: "Cross-Modal Contrastive Learning for Robust Reasoning in VQA", arxiv, 2022 (University of Sydney). [Paper][PyTorch]
    • CL-CrossVQA: "CL-CrossVQA: A Continual Learning Benchmark for Cross-Domain Visual Question Answering", arXiv, 2022 (LMU Munich). [Paper]
    • OFA-X: "Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations", arXiv, 2022 (University of Hamburg, Germany). [Paper][Code (in construction)]
    • VLC-BERT: "VLC-BERT: Visual Question Answering with Contextualized Commonsense Knowledge", WACV, 2023 (UBC, Canada). [Paper][PyTorch]
    • LTG: "Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA", AAAI, 2023 (USTC). [Paper]
    • SelTDA: "Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!", CVPR, 2023 (NEC). [Paper][PyTorch]
    • Prophet: "Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering", CVPR, 2023 (Hangzhou Dianzi University). [Paper][PyTorch]
    • GenB: "Generative Bias for Robust Visual Question Answering", CVPR, 2023 (KAIST). [Paper]
    • MixPHM: "MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering", CVPR, 2023 (Xi'an Jiaotong University). [Paper]
    • POEM: "Divide and Conquer: Answering Questions with Object Factorization and Compositional Reasoning", CVPR, 2023 (University of Minnesota (UMN)). [Paper][PyTorch]
    • LYP: "Improving Selective Visual Question Answering by Learning From Your Peers", CVPR, 2023 (Meta). [Paper]
    • VQACL: "VQACL: A Novel Visual Question Answering Continual Learning Setting", CVPR, 2023 (CAS). [Paper][PyTorch]
    • Img2LLM: "From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models", CVPR, 2023 (Salesforce). [Paper][PyTorch]
    • Imp-VQA: "Logical Implications for Visual Question Answering Consistency", CVPR, 2023 (University of Bern, Switzerland). [Paper][PyTorch][Website]
    • RMLVQA: "RMLVQA: A Margin Loss Approach For Visual Question Answering with Language Biases", CVPR, 2023 (Indian Institute of Science). [Paper][PyTorch]
    • S3C: "S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning", CVPR, 2023 (Northwestern Polytechnical University, China). [Paper]
    • ?: "Diversifying Joint Vision-Language Tokenization Learning", CVPRW, 2023 (DeepMind). [Paper]
    • VQAAnswerTherapy: "VQA Therapy: Exploring Answer Differences by Visually Grounding Answers", ICCV, 2023 (UT Austin). [Paper][Website]
    • ViTiS: "Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts", ICCVW, 2023 (INRIA). [Paper][Website]
    • TwO: "Combo of Thinking and Observing for Outside-Knowledge VQA", ACL, 2023 (ByteDance). [Paper][Code (in construction)]
    • Mod-Zero-VQA: "Modularized Zero-shot VQA with Pre-trained Models", ACL Findings, 2023 (Singapore Management University). [Paper]
    • SaL: "Separate and Locate: Rethink the Text in Text-based Visual Question Answering", ACMMM, 2023 (CAS). [Paper][Code (in construction)]
    • InfoSeek: "Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?", arXiv, 2023 (Google). [Paper][Website]
    • CoVGT: "Contrastive Video Question Answering via Video Graph Transformer", arXiv, 2023 (NUS). [Paper]
    • RVQA: "Toward Unsupervised Realistic Visual Question Answering", arXiv, 2023 (UCSD). [Paper]
    • WHOOP: "Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images", arXiv, 2023 (Ben Gurion University of the Negev, Israel). [Paper][Website]
    • IVLT: "Causality-aware Visual Scene Discovery for Cross-Modal Question Reasoning", arXiv, 2023 (Sun-Yat-Sen University). [Paper]
    • MGT: "Multimodal Graph Transformer for Multimodal Question Answering", arXiv, 2023 (UC Santa Cruz). [Paper]
    • VCSR: "Visual Causal Scene Refinement for Video Question Answering", arXiv, 2023 (Sun-Yat-Sen University). [Paper]
    • SeeTRUE: "What You See is What You Read? Improving Text-Image Alignment Evaluation", arXiv, 2023 (Google). [Paper][PyTorch][Website]
    • JADE: "Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner", arXiv, 2023 (CAS). [Paper]
    • NuScenes-QA: "NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario", arXiv, 2023 (Fudan). [Paper][Code (in construction)]
    • LAMOC: "Zero-shot Visual Question Answering with Language Model Feedback", arXiv, 2023 (Renmin University of China). [Paper][PyTorch]
    • PW-VQA: "Unveiling Cross Modality Bias in Visual Question Answering: A Causal View with Possible Worlds VQA", arXiv, 2023 (University of Rochester). [Paper]
    • Encyclopedic-VQA: "Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories", arXiv, 2023 (Google). [Paper]
    • ?: "Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering", arXiv, 2023 (Mila). [Paper]
    • R2A: "Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models", arXiv, 2023 (CUHK). [Paper]
    • WikiTiLo: "Can Vision-Language Models be a Good Guesser? Exploring VLMs for Times and Location Reasoning", arXiv, 2023 (LMU Munich). [Paper]
    • GenVQA: "Generative Visual Question Answering", arXiv, 2023 (UW). [Paper]
    • Context-VQA: "Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering", arXiv, 2023 (Stanford). [Paper]
    • BLIVA: "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions", arXiv, 2023 (USCD). [Paper]
    • NExT-GQA: "Can I Trust Your Answer? Visually Grounded Video Question Answering", arXiv, 2023 (NUS). [Paper]
    • CURE: "Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models", arXiv, 2023 (SRI). [Paper][Code (in construction)]
    • RepARe: "Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models", arXiv, 2023 (UNC). [Paper][PyTorch]
  • Video:
    • ?: "Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering", arXiv, 2021 (Seoul National University). [Paper]
    • TPT: "Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering", arXiv, 2021 (CAS). [Paper]
    • SwinBERT: "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • WildQA: "WildQA: In-the-Wild Video Question Answering", International Conference on Computational Linguistics (COLING), 2022 (UMich). [Paper][Website]
    • VGT: "Video Graph Transformer for Video Question Answering", ECCV, 2022 (Sea AI Lab). [Paper][PyTorch]
    • ?: "Video Question Answering with Iterative Video-Text Co-Tokenization", ECCV, 2022 (Google). [Paper][Website (in construction)]
    • DeST: "Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling", BMVC, 2022 (NTU). [Paper][PyTorch]
    • ViteVQA: "Towards Video Text Visual Question Answering: Benchmark and Baseline", NeurIPS, 2022 (ByteDance). [Paper][GitHub]
    • WSQG: "Frame-Subtitle Self-Supervision for Multi-Modal Video Question Answering", arXiv, 2022 (Zhejiang University). [Paper]
    • LocAns: "Locate before Answering: Answer Guided Question Localization for Video Question Answering", arXiv, 2022 (Fudan University). [Paper]
    • NewsVideoQA: "Watching the News: Towards VideoQA Models that can Read", arXiv, 2022 (IIIT Hyderabad, India). [Paper]
    • SHG-VQA: "Learning Situation Hyper-Graphs for Video Question Answering", CVPR, 2023 (UCF). [Paper][PyTorch]
    • ANetQA: "ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos", CVPR, 2023 (Hangzhou Dianzi University). [Paper][Website]
    • MCR: "Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering", CVPR, 2023 (Beijing Institute of Technology). [Paper][Code (in construction)]
    • MIST: "MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering", CVPR, 2023 (NUS). [Paper][PyTorch]
    • CaKE-LM: "Language Models are Causal Knowledge Extractors for Zero-shot Video Question Answering", CVPRW, 2023 (NTU + Columbia). [Paper]
    • TransSTR: "Discovering Spatio-Temporal Rationales for Video Question Answering", ICCV, 2023 (NUS). [Paper]
    • Tem-adapter: "Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer", ICCV, 2023 (CMU). [Paper][Code (in construction)]
    • OVQA: "Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models", ICCV, 2023 (Korea University). [Paper]
    • RaFormer: "Redundancy-aware Transformer for Video Question Answering", ACMMM, 2023 (NUS). [Paper]
    • SeViLA: "Self-Chained Image-Language Model for Video Localization and Question Answering", arXiv, 2023 (UNC). [Paper][PyTorch]
    • FunQA: "FunQA: Towards Surprising Video Comprehension", arXiv, 2023 (Beijing University of Posts and Telecommunication). [Paper][Code (in construction)][Website]
  • 3D:
    • 3D-VQA: "CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes", CVPRW, 2023 (ETHZ). [Paper][Code (in construction)]
    • Multi-CLIP: "Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes", arXiv, 2023 (ETHZ). [Paper]
  • Audio-Visual:
    • PSTP-Net: "Progressive Spatio-temporal Perception for Audio-Visual Question Answering", ACMMM, 2023 (Renmin University of China). [Paper][PyTorch]

[Back to Overview]

Visual Grounding

  • General:
    • TransRefer3D: "TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding", ACMMM, 2021 (Beihang University). [Paper]
    • ?: "Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers", EMNLP, 2021 (University of Trento). [Paper]
    • MITVG: "Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation", ACL Findings, 2021 (Tencent). [Paper]
    • TransVG: "TransVG: End-to-End Visual Grounding with Transformers", ICCV, 2021 (USTC). [Paper]
    • GSRTR: "Grounded Situation Recognition with Transformers", BMVC, 2021 (POSTECH). [Paper][PyTorch]
    • Referring-Transformer: "Referring Transformer: A One-step Approach to Multi-task Visual Grounding", NeurIPS, 2021 (UBC). [Paper]
    • VGTR: "Visual Grounding with Transformers", arXiv, 2021 (Beihang University). [Paper]
    • UNICORN: "Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling", arXiv, 2021 (Microsoft). [Paper]
    • Word2Pix: "Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding", arXiv, 2021 (A*STAR). [Paper]
    • CoFormer: "Collaborative Transformers for Grounded Situation Recognition", CVPR, 2022 (POSTECH). [Paper][PyTorch]
    • MVT: "Multi-View Transformer for 3D Visual Grounding", CVPR, 2022 (CUHK). [Paper][PyTorch]
    • GLIP: "Grounded Language-Image Pre-training", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • M-DGT: "Multi-Modal Dynamic Graph Transformer for Visual Grounding", CVPR, 2022 (University of Toronto). [Paper][PyTorch]
    • QRNet: "Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding", CVPR, 2022 (East China Normal University). [Paper][PyTorch]
    • SiRi: "SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding", ECCV, 2022 (JD). [Paper][PyTorch]
    • UniTAB: "UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling", ECCV, 2022 (Microsoft). [Paper]
    • TAP: "Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers", ECCV, 2022 (Adobe). [Paper][GitHub][Website]
    • YORO: "YORO - Lightweight End to End Visual Grounding", ECCVW, 2022 (Amazon). [Paper]
    • GLIPv2: "GLIPv2: Unifying Localization and Vision-Language Understanding", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
    • ?: "Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?", EMNLP, 2022 (Aix-Marseille University, France). [Paper]
    • SeqTR: "SeqTR: A Simple yet Universal Network for Visual Grounding", arXiv, 2022 (Xiamen University). [Paper][Code (in construction)]
    • TransVG++: "TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer", arXiv, 2022 (USTC). [Paper]
    • HLGT: "Hierarchical Local-Global Transformer for Temporal Sentence Grounding", arXiv, 2022 (Huazhong University of Science and Technology). [Paper]
    • Dynamic-MDETR: "Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding", arXiv, 2022 (Nanjing University). [Paper]
    • ClipCrop: "ClipCrop: Conditioned Cropping Driven by Vision-Language Model", arXiv, 2022 (The University of Tokyo). [Paper]
    • VL-MPAG-Net: "Grounding Scene Graphs on Natural Images via Visio-Lingual Message Passing", WACV, 2023 (Indian Institute of Science). [Paper][PyTorch][Website]
    • CLEVER: "Visually Grounded Commonsense Knowledge Acquisition", AAAI, 2023 (Tsinghua University). [Paper][PyTorch]
    • LADS: "Referring Expression Comprehension Using Language Adaptive Inference", AAAI, 2023 (Zhejiang University). [Paper]
    • ?: "Learning to Jointly Share and Prune Weights for Grounding Based Vision and Language Models", ICLR, 2023 (Samsung). [Paper]
    • AMC: "Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • CounTEX: "Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space", CVPR, 2023 (Amazon). [Paper]
    • SK-VG: "Advancing Visual Grounding with Scene Knowledge: Benchmark and Method", CVPR, 2023 (Sun Yat-sen University). [Paper][Code (in construction)]
    • D-ViTMDETR: "Dynamic Inference with Grounding Based Vision and Language Models", CVPR, 2023 (Amazon). [Paper]
    • ?: "Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding", CVPR, 2023 (Tel Aviv). [Paper][Code (in construction)]
    • RefCLIP: "RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension", CVPR, 2023 (Xiamen University). [Paper][PyTorch][Website]
    • FROMAGe: "Grounding Language Models to Images for Multimodal Inputs and Outputs", ICML, 2023 (CMU). [Paper][PyTorch][Website]
    • IR-VG: "Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision", ICCV, 2023 (Beihang). [Paper][Code (in construction)]
    • RefEgo: "RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D", ICCV, 2023 (RIKEN). [Paper]
    • CLIP-VG: "CLIP-VG: Self-paced Curriculum Adapting of CLIP via Exploiting Pseudo-Language Labels for Visual Grounding", arXiv, 2023 (CAS). [Paper][Code (in construction)]
    • TreePrompt: "TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding", arXiv, 2023 (HKUST). [Paper]
    • OctoBERT: "World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models", arXiv, 2023 (UMich). [Paper]
    • BuboGPT: "BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs", arXiv, 2023 (ByteDance). [Paper][PyTorch][Website]
    • LG-DVG: "Language-Guided Diffusion Model for Visual Grounding", arXiv, 2023 (University of Toronto). [Paper]
    • VGDiffZero: "VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders", arXiv, 2023 (Westlake University, China). [Paper]
    • GREC: "GREC: Generalized Referring Expression Comprehension", arXiv, 2023 (NTU, Singapore). [Paper][Website]VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
  • Video:
    • Multi-Stage-Transformer: "Multi-Stage Aggregated Transformer Network for Temporal Language Localization in Videos", CVPR, 2021 (University of Electronic Science and Technology of China). [Paper]
    • GTR: "On Pursuit of Designing Multi-modal Transformer for Video Grounding", EMNLP, 2021 (Peking). [Paper]
    • STVGBert: "STVGBert: A Visual-Linguistic Transformer Based Framework for Spatio-Temporal Video Grounding", ICCV, 2021 (Tencent). [Paper]
    • DRFT: "End-to-end Multi-modal Video Temporal Grounding", NeurIPS, 2021 (UC Merced). [Paper]
    • TubeDETR: "TubeDETR: Spatio-Temporal Video Grounding with Transformers", CVPR, 2022 (INRIA). [Paper][Website]
    • UMT: "UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection", CVPR, 2022 (Tencent). [Paper][Code (in constrcution)]
    • STVGFormer: "STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic Cross-Modal Understanding", ACMMMW, 2022 (Sun Yat-sen University). [Paper]
    • STCAT: "Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
    • VideoWhisperer: "Grounded Video Situation Recognition", NeurIPS, 2022 (IIIT Hyderabad, India). [Paper][Website]
    • VidGTR: "Explore and Match: End-to-End Video Grounding with Transformer", arXiv, 2022 (KAIST). [Paper]
    • ?: "Language-free Training for Zero-shot Video Grounding", WACV, 2023 (Yonsei University). [Paper]
    • VG-LAW: "Language Adaptive Weight Generation for Multi-task Visual Grounding", CVPR, 2023 (Zhejiang University). [Paper]
    • TCSF: "You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos", CVPR, 2023 (Huazhong University of Science and Technology). [Paper]
    • ?: "Weakly Supervised Temporal Sentence Grounding with Uncertainty-Guided Self-training", CVPR, 2023 (The University of Tokyo). [Paper]
    • DeCo: "DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking", CVPR, 2023 (Toyota). [Paper]
    • HSCNet: "Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding", CVPR, 2023 (Sun Yat-sen University). [Paper]
    • WINNER: "WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding", CVPR, 2023 (Zhejiang University). [Paper]
    • IRON: "Iterative Proposal Refinement for Weakly-Supervised Video Grounding", CVPR, 2023 (Microsoft). [Paper]
    • ?: "Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding", CVPR, 2023 (Sun Yat-sen University). [Paper]
    • ProTeGe: "ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding", CVPR, 2023 (Microsoft). [Paper]
    • VidLN: "Connecting Vision and Language with Video Localized Narratives", CVPR, 2023 (Google). [Paper][Website]
    • VDI: "Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training", CVPR, 2023 (Queen Mary University of London). [Paper]
    • UniVTG: "UniVTG: Towards Unified Video-Language Temporal Grounding", ICCV, 2023 (NUS). [Paper][PyTorch]
    • EaTR: "Knowing Where to Focus: Event-aware Transformer for Video Grounding", ICCV, 2023 (Yonsei). [Paper][PyTorch]
    • TSGSV: "Temporal Sentence Grounding in Streaming Videos", ACMMM, 2023 (Shandong University). [Paper]
    • ?: "Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos", arXiv, 2023 (Southern University of Science and Technology, China). [Paper]
    • MomentDiff: "MomentDiff: Generative Video Moment Retrieval from Random to Real", arXiv, 2023 (Alibaba). [Paper][Code (in construction)]
    • BM-DETR: "Overcoming Weak Visual-Textual Alignment for Video Moment Retrieval", arXiv, 2023 (Seoul National University (SNU)). [Paper][PyTorch (in construction)]
    • ?: "Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models", WACV, 2024 (Queen Mary University of London). [Paper]
  • 3D:
    • ViL3DRel: "Language Conditioned Spatial Relation Reasoning for 3D Object Grounding", NeurIPS, 2022 (INRIA). [Paper][Website]
    • LAR: "Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding", NeurIPS, 2022 (KAUST). [Paper][Website]
    • 3D-CG: "3D Concept Grounding on Neural Fields", NeurIPS, 2022 (MIT). [Paper][Website]
    • NS3D: "NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations", CVPR, 2023 (Stanford). [Paper]
    • EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding", CVPR, 2023 (Peking University). [Paper]
    • ?: "Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding", ICCV, 2023 (Zhejiang University). [Paper]
    • Multi3DRefer: "Multi3DRefer: Grounding Text Description to Multiple 3D Objects", ICCV, 2023 (Simon Fraser). [Paper]
    • UniT3D: "UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding", ICCV, 2023 (TUM). [Paper]
    • 3DOGSFormer: "Dense Object Grounding in 3D Scenes", ACMMM, 2023 (Peking). [Paper]
    • ViewRefer: "ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • ?: "What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions", arXiv, 2023 (Columbia). [Paper]
    • 3DRP-Net: "3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding", arXiv, 2023 (Zhejiang University). [Paper]
    • 3DRefTR: "A Unified Framework for 3D Point Cloud Visual Grounding", arXiv, 2023 (Xiamen University). [Paper][PyTorch]
    • CoT3DRef: "CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding", arXiv, 2023 (KAUST). [Paper]

[Back to Overview]

Multi-Modal Representation Learning

  • General:
    • LXMERT: "LXMERT: Learning Cross-Modality Encoder Representations from Transformers", EMNLP, 2019 (UNC). [Paper][PyTorch]
    • ViLBERT: "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks", NeurIPS, 2019 (Georgia Tech). [Paper][PyTorch]
    • Unified-VLP: "Unified Vision-Language Pre-Training for Image Captioning and VQA", AAAI, 2020 (UMich + Microsoft). [Paper][PyTorch]
    • UNITER: "UNITER: UNiversal Image-TExt Representation Learning", ECCV, 2020 (Microsoft). [Paper][PyTorch]
    • VinVL: "VinVL: Revisiting Visual Representations in Vision-Language Models", CVPR, 2021 (Microsoft). [Paper][Code]
    • CATT: "Causal Attention for Vision-Language Tasks", CVPR, 2021 (NTU Singapore). [Paper][PyTorch]
    • ViLT: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision", ICML, 2021 (Kakao). [Paper][PyTorch]
    • MERLOT: "MERLOT: Multimodal Neural Script Knowledge Models", NeurIPS, 2021 (UW + AI2). [Paper][Tensorflow][Website]
    • SVO-Probes: "Probing Image-Language Transformers for Verb Understanding", arXiv, 2021 (DeepMind). [Paper]
    • CLIP-ViL: "How Much Can CLIP Benefit Vision-and-Language Tasks?", arXiv, 2021 (Berkeley + UCLA). [Paper][PyTorch]
    • Florence: "Florence: A New Foundation Model for Computer Vision", arXiv, 2021 (Microsoft). [Paper]
    • UFO: "UFO: A UniFied TransfOrmer for Vision-Language Representation Learning", arXiv, 2021 (Microsoft). [Paper]
    • SimVLM: "SimVLM: Simple Visual Language Model Pretraining with Weak Supervision", ICLR, 2022 (Google). [Paper]
    • LiT: "LiT: Zero-Shot Transfer with Locked-image text Tuning", CVPR, 2022 (Google). [Paper]
    • UniCL: "Unified Contrastive Learning in Image-Text-Label Space", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • FLAVA: "FLAVA: A Foundational Language And Vision Alignment Model", CVPR, 2022 (Meta). [Paper][Pretrained Model][Code][Dataset][Website][Demos]
    • LEMON: "Scaling Up Vision-Language Pre-training for Image Captioning", CVPR, 2022 (Microsoft). [Paper]
    • METER: "An Empirical Study of Training End-to-End Vision-and-Language Transformers", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • Uni-Perceiver: "Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks", CVPR, 2022 (SenseTime). [Paper][PyTorch]
    • MERLOT-Reserve: "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound", CVPR, 2022 (UW + AI2). [Paper][JAX][Website]
    • Omnivore: "Omnivore: A Single Model for Many Visual Modalities", CVPR, 2022 (Meta). [Paper][PyTorch][Website]
    • CM-mix: "Pre-training image-language transformers for open-vocabulary tasks", CVPRW, 2022 (Google). [Paper]
    • VLMixer: "VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix", ICML, 2022 (Southern University of Science and Technology). [Paper][Code (in construction)]
    • VLUE: "VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models", ICML, 2022 (ByteDance). [Paper][Website][PyTorch]
    • X-VLM: "Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts", ICML, 2022 (ByteDance). [Paper][PyTorch]
    • BLIP: "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation", ICML, 2022 (Salesforce). [Paper][PyTorch]
    • OFA: "OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework", ICML, 2022 (Alibaba). [Paper][PyTorch]
    • MS-CLIP: "Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • GRIT-VLP: "GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training", ECCV, 2022 (Microsoft). [Paper][PyTorch]
    • SIMLA: "Single-Stream Multi-Level Alignment for Vision-Language Pretraining", ECCV, 2022 (Northeastern University). [Paper][PyTorch][Website]
    • Switch-BERT: "Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input", ECCV, 2022 (Ant Group). [Paper]
    • OmniVL: "OmniVL: One Foundation Model for Image-Language and Video-Language Tasks", NeurIPS, 2022 (Microsoft). [Paper]
    • UniCLIP: "UniCLIP: Unified Framework for Contrastive Language-Image Pre-training", NeurIPS, 2022 (LG). [Paper]
    • Uni-Perceiver-MoE: "Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs", NeurIPS, 2022 (SenseTime). [Paper][PyTorch]
    • CLOOB: "CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP", NeurIPS, 2022 (Johannes Kepler University, Austria). [Paper][PyTorch]
    • CyCLIP: "CyCLIP: Cyclic Contrastive Language-Image Pretraining", NeurIPS, 2022 (UCLA). [Paper]
    • ?: "Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP", NeurIPS, 2022 (UW). [Paper][Pytorch]
    • PyramidCLIP: "PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining", NeurIPS, 2022 (Tencent). [Paper]
    • ?: "Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning", NeurIPS, 2022 (Stanford). [Paper][Website]
    • LIMoE: "Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts", NeurIPS, 2022 (Google). [Paper]
    • VLMo: "VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts", NeurIPS, 2022 (Microsoft). [Paper][PyTorch (in construction)]
    • Knowledge-CLIP: "Contrastive Language-Image Pre-Training with Knowledge Graphs", NeurIPS, 2022 (Tsinghua). [Paper]
    • Flamingo: "Flamingo: a Visual Language Model for Few-Shot Learning", NeurIPS, 2022 (DeepMind). [Paper]
    • LOUPE: "Fine-Grained Semantically Aligned Vision-Language Pre-Training", NeurIPS, 2022 (Huawei). [Paper][Code (in construction)]
    • FIBER: "Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone", NeurIPS, 2022 (Microsoft). [Paper][PyTorch]
    • UViM: "UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes", NeurIPS, 2022 (Google). [Paper]
    • LAION-5B: "LAION-5B: An open large-scale dataset for training next generation image-text models", NeurIPS (Datasets and Benchmarks), 2022 (LAION). [Paper][Website]
    • Wukong: "Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark", NeurIPS (Datasets and Benchmarks), 2022 (Huawei). [Paper][Website]
    • TaiSu: "TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training", NeurIPS (Datasets and Benchmarks), 2022 (CAS). [Paper][PyTorch]
    • WinoGAViL: "WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models", NeurIPS (Datasets and Benchmarks), 2022 (The Hebrew University of Jerusalem, Israel). [Paper][Website]
    • ELEVATER: "ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models", NeurIPS (Datasets and Benchmarks), 2022 (Microsoft). [Paper][Website]
    • ?: "Robustness Analysis of Video-Language Models Against Visual and Language Perturbations", NeurIPS (Datasets and Benchmarks), 2022 (UCF). [Paper][Website]
    • GIT: "GIT: A Generative Image-to-text Transformer for Vision and Language", TMLR, 2022 (Microsoft). [Paper]
    • CoCa: "CoCa: Contrastive Captioners are Image-Text Foundation Models", TMLR, 2022 (Google). [Paper][PyTorch (lucidrains)]
    • MultiMAE: "MultiMAE: Multi-modal Multi-task Masked Autoencoders", arXiv, 2022 (EPFL). [Paper][PyTorch][Website]
    • VLC: "Training Vision-Language Transformers from Captions Alone", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • CCLM: "Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training", arXiv, 2022 (ByteDance). [Paper]
    • VL-BEiT: "VL-BEiT: Generative Vision-Language Pretraining", arXiv, 2022 (Microsoft). [Paper]
    • MetaLM: "Language Models are General-Purpose Interfaces", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • Bridge-Tower: "Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
    • e-CLIP: "e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce", arXiv, 2022 (NAVER). [Paper]
    • LW-Transformer: "Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks", arXiv, 2022 (Xiamen University). [Paper][PyTorch]
    • UCM: "Self-Training Vision Language BERTs with a Unified Conditional Model", arXiv, 2022 (NTU, Singapore). [Paper]
    • Prefix-conditioning: "Prefix Conditioning Unifies Language and Label Supervision", arXiv, 2022 (Google). [Paper]
    • VLMAE: "VLMAE: Vision-Language Masked Autoencoder", arXiv, 2022 (Tencent). [Paper]
    • ViCHA: "Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment", arXiv, 2022 (Sorbonne University, France). [Paper][Code (in construction)]
    • DetailCLIP: "Injecting Image Details into CLIP's Feature Space", arXiv, 2022 (Megvii). [Paper]
    • ?: "Pre-training image-language transformers for open-vocabulary tasks", arXiv, 2022 (Google). [Paper]
    • ERNIE: "ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training", arXiv, 2022 (Baidu). [Paper][Paddle]
    • VoLTA: "VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment", arXiv, 2022 (JHU). [Paper]
    • ?: "One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks", arXiv, 2022 (Technical University of Darmstadt, Germany). [Paper]
    • MAPL: "MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting", arXiv, 2022 (Mila). [Paper]
    • EfficientVLM: "EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning", arXiv, 2022 (Bytedance). [Paper][PyTorch (in construction)]
    • CN-CLIP: "Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese", arXiv, 2022 (Alibaba). [Paper]
    • CLOSE: "I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data", arXiv, 2022 (AI2). [Paper]
    • X2-VLM: "X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks", arXiv, 2022 (ByteDance). [Paper][Code (in construction)]
    • SkillNet: "One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code", arXiv, 2022 (Tencent). [Paper]
    • Compound-Tokens: "Compound Tokens: Channel Fusion for Vision-Language Representation Learning", arXiv, 2022 (Google). [Paper]
    • WFH: "Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision", WACV, 2023 (Aalto University, Finland). [Paper]
    • Perceiver-VL: "Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention", WACV, 2023 (UNC). [Paper][PyTorch]
    • MixGen: "MixGen: A New Multi-Modal Data Augmentation", WACVW, 2023 (Amazon). [Paper]
    • ?: "Unifying Vision-Language Representation Space with Single-tower Transformer", AAAI, 2023 (NAVER). [Paper]
    • PaLI: "PaLI: A Jointly-Scaled Multilingual Language-Image Model", ICLR, 2023 (Google). [Paper]
    • LilT: "Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning", ICLR, 2023 (Northeastern University). [Paper][PyTorch]
    • CLIPs: "Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning", ICLR, 2023 (Stanford). [Paper]
    • HiCLIP: "HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention", ICLR, 2023 (Rutgers University). [Paper]
    • DeCap: "DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training", ICLR, 2023 (Zhejiang University). [Paper][PyTorch]
    • MaskVLM: "Masked Vision and Language Modeling for Multi-modal Representation Learning", ICLR, 2023 (Amazon). [Paper]
    • DaVinci: "Write and Paint: Generative Vision-Language Models are Unified Modal Learners", ICLR, 2023 (ByteDance). [Paper][Code (in construction)]
    • EVA: "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale", CVPR, 2023 (Beijing Academy of Artificial Intelligence (BAAI)). [Paper][PyTorch]
    • FLM: "Accelerating Vision-Language Pretraining with Free Language Modeling", CVPR, 2023 (Tencent). [Paper][PyTorch]
    • FDT: "Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens", CVPR, 2023 (ByteDance). [Paper][Code (in construction)]
    • VILA: "VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining", CVPR, 2023 (Google). [Paper][JAX]
    • BEiT-3: "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • ReVeaL: "REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory", CVPR, 2023 (Google). [Paper][Website]
    • SCL: "Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning", CVPR, 2023 (Tencent). [Paper]
    • EPIC: "Leveraging per Image-Token Consistency for Vision-Language Pre-training", CVPR, 2023 (ByteDance). [Paper]
    • PTP: "Position-guided Text Prompt for Vision-Language Pre-training", CVPR, 2023 (Sea AI Lab). [Paper][PyTorch]
    • PHASE: "Uncurated Image-Text Datasets: Shedding Light on Demographic Bias", CVPR, 2023 (Osaka University). [Paper][GitHub]
    • Uni-Perceiver-v2: "Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • ?: "Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language", CVPR, 2023 (Beijing Institute of Technology). [Paper]
    • GIVL: "GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods", CVPR, 2023 (Amazon). [Paper]
    • FLIP: "Scaling Language-Image Pre-training via Masking", CVPR, 2023 (Meta). [Paper][PyTorch]
    • MAP: "MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model", CVPR, 2023 (Tsinghua + Waseda). [Paper][PyTorch
    • DANCE: "Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles", CVPR, 2023 (Microsoft). [Paper][PyTorch (in construction)][Website]
    • xCLIP: "Non-Contrastive Learning Meets Language-Image Pre-Training", CVPR, 2023 (Microsoft). [Paper]
    • SVLC: "Teaching Structured Vision & Language Concepts to Vision&Language Models", CVPR, 2023 (IBM). [Paper]
    • DeAR: "DeAR: Debiasing Vision-Language Models with Additive Residuals", CVPR, 2023 (Adobe). [Paper][GitHub]
    • ?: "Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning", CVPR, 2023 (Amazon). [Paper]
    • ?: "Joint Adaptive Representations for Image-Language Learning", CVPRW, 2023 (DeepMind). [Paper]
    • BLIP-2: "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models", ICML, 2023 (Salesforce). [Paper][PyTorch]
    • RLEG: "RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation", ICML, 2023 (Alibaba). [Paper]
    • Mod-X: "Continual Vision-Language Representation Learning with Off-Diagonal Information", ICML, 2023 (Huawei). [Paper]
    • ILLUME: "ILLUME: Rationalizing Vision-Language Models through Human Interactions", ICML, 2023 (German Center for Artificial Intelligence (DFKI)). [Paper][PyTorch]
    • Pix2Struct: "Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding", ICML, 2023 (Google). [Paper]
    • MERU: "Hyperbolic Image-Text Representations", ICML, 2023 (Meta). [Paper]
    • ?: "Measuring Progress in Fine-grained Vision-and-Language Understanding", ACL, 2023 (DeepMind). [Paper]
    • RELIT: "Weakly Supervised Vision-and-Language Pre-training with Relative Representations", ACL, 2023 (Tsinghua). [Paper]
    • PuMer: "PuMer: Pruning and Merging Tokens for Efficient Vision Language Models", ACL, 2023 (UW). [Paper]
    • SINC: "SINC: Self-Supervised In-Context Learning for Vision-Language Tasks", ICCV, 2023 (Microsoft). [Paper]
    • ALIP: "ALIP: Adaptive Language-Image Pre-training with Synthetic Caption", ICCV, 2023 (DeepGlint, China). [Paper][PyTorch]
    • SigLiT: "Sigmoid Loss for Language Image Pre-Training", ICCV, 2023 (Google). [Paper]
    • VL-PET: "VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control", ICCV, 2023 (CUHK). [Paper][PyTorch]
    • GrowCLIP: "GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training", ICCV, 2023 (Sun Yat-sen University). [Paper]
    • ViLLA: "ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data", ICCV, 2023 (Stanford). [Paper][PyTorch]
    • CFM-ViT: "Contrastive Feature Masking Open-Vocabulary Vision Transformer", ICCV, 2023 (DeepMind). [Paper]
    • OPTIMA: "Module-wise Adaptive Distillation for Multimodality Foundation Models", NeurIPS, 2023 (Google). [Paper]
    • KOSMOS-1: "Language Is Not All You Need: Aligning Perception with Language Models", arXiv, 2023 (Microsoft). [Paper][Code]
    • Prismer: "Prismer: A Vision-Language Model with An Ensemble of Experts", arXiv, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • RVLM: "Replacement as a Self-supervision for Fine-grained Vision-language Pre-training", arXiv, 2023 (Harbin Institute of Technology). [Paper]
    • MuLTI: "MuLTI: Efficient Video-and-Language Understanding with MultiWay-Sampler and Multiple Choice Modeling", arXiv, 2023 (Alibaba). [Paper]
    • VL-MoE: "Scaling Vision-Language Models with Sparse Mixture of Experts", arXiv, 2023 (Berkeley + Microsoft). [Paper]
    • EVA-02: "EVA-02: A Visual Representation for Neon Genesis", arXiv, 2023 (BAAI). [Paper][PyTorch]
    • CoBIT: "CoBIT: A Contrastive Bi-directional Image-Text Generation Model", arXiv, 2023 (Google). [Paper]
    • EqSim: "Equivariant Similarity for Vision-Language Foundation Models", arXiv, 2023 (Microsoft). [Paper][PyTorch]
    • EVA-CLIP: "EVA-CLIP: Improved Training Techniques for CLIP at Scale", arXiv, 2023 (BAAI). [Paper][PyTorch]
    • Sig: "Sigmoid Loss for Language Image Pre-Training", arXiv, 2023 (Google). [Paper]
    • MaMMUT: "MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks", arXiv, 2023 (Google). [Paper]
    • CAVL: "CAVL: Learning Contrastive and Adaptive Representations of Vision and Language", arXiv, 2023 (CMU). [Paper]
    • MoMo: "MoMo: A shared encoder Model for text, image and multi-Modal representations", arXiv, 2023 (Amazon). [Paper]
    • REAVL: "Retrieval-based Knowledge Augmented Vision Language Pre-training", arXiv, 2023 (Tencent). [Paper]
    • ALBEF-MI: "Vision Lanauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation", arXiv, 2023 (Alibaba). [Paper]
    • Helip: "Boosting Visual-Language Models by Exploiting Hard Samples", arXiv, 2023 (Huawei). [Paper]
    • IMP: "Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception", arXiv, 2023 (Google). [Paper]
    • Musketeer: "Musketeer (All for One, and One for All): A Generalist Vision-Language Model with Task Explanation Prompts", arXiv, 2023 (Amazon). [Paper]
    • GVT: "What Makes for Good Visual Tokenizers for Large Language Models?", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • S-CLIP: "S-CLIP: Semi-supervised Vision-Language Pre-training using Few Specialist Captions", arXiv, 2023 (KAIST). [Paper]
    • VisorGPT: "VisorGPT: Learning Visual Prior via Generative Pre-Training", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
    • IdealGPT: "IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models", arXiv, 2023 (Columbia University). [Paper][PyTorch]
    • PaLI-X: "PaLI-X: On Scaling up a Multilingual Vision and Language Model", arXiv, 2023 (Google). [Paper]
    • CrossGET: "CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers", arXiv, 2023 (Tsinghua). [Paper][Code (in construction)]
    • TL;DR: "Too Large; Data Reduction for Vision-Language Pre-Training", arXiv, 2023 (NUS). [Paper][Code (in construction)]
    • DiffusionITM: "Are Diffusion Models Vision-And-Language Reasoners?", arXiv, 2023 (Mila). [Paper]
    • COSA: "COSA: Concatenated Sample Pretrained Vision-Language Foundation Model", arXiv, 2023 (ByteDance). [Paper][PyTorch]
    • Babel-ImageNet: "Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations", arXiv, 2023 (University of Würzburg, Germany). [Paper][PyTorch]
    • Kosmos-2: "Kosmos-2: Grounding Multimodal Large Language Models to the World", arXiv, 2023 (Microsoft). [Paper][PyTorch][Demo]
    • LENS: "Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language", arXiv, 2023 (Contextual AI + Stanford). [Paper][PyTorch][Demo]
    • OBELISC: "OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents", arXiv, 2023 (Hugging Face). [Paper][GitHub]
    • Emu: "Generative Pretraining in Multimodality", arXiv, 2023 (BAAI). [Paper][PyTorch]
    • mBLIP: "mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs", arXiv, 2023 (University of Wurzburg, Germany). [Paper][PyTorch]
    • P-Former: "Bootstrapping Vision-Language Learning with Decoupled Language Pre-training", arXiv, 2023 (Dartmouth College). [Paper]
    • SEED-OPT: "Planting a SEED of Vision in Large Language Model", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • OpenFlamingo: "OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models", arXiv, 2023 (UW). [Paper][PyTorch]
    • Free-ATM: "Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks", arXiv, 2023 (ByteDance). [Paper]
    • LCL: "Link-Context Learning for Multimodal LLMs", arXiv, 2023 (SenseTime). [Paper]
    • DLIP: "DLIP: Distilling Language-Image Pre-training", arXiv, 2023 (ByteDance). [Paper]
    • ViLTA: "ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation", arXiv, 2023 (Tsinghua). [Paper]
    • DAS: "Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models", arXiv, 2023 (Xiamen University). [Paper]
    • LaVIT: "Unified Language-Vision Pretraining with Dynamic Discrete Visual Tokenization", arXiv, 2023 (Kuaishou). [Paper][Code (in construction)]
    • MMICL: "MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning", arXiv, 2023 (Peking). [Paper][PyTorch]
    • ELIP: "ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens", arXiv, 2023 (NUS). [Paper]
    • SEED-LLaMA: "Making LLaMA SEE and Draw with SEED Tokenizer", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • ITIT: "Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency", arXiv, 2023 (Google). [Paper]
    • SimVLG: "SimVLG: Simple and Efficient Pretraining of Visual Language Generative Models", arXiv, 2023 (ByteDance). [Paper]
    • VeCLIP: "From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions", arXiv, 2023 (Apple). [Paper]
    • PaLI-3: "PaLI-3 Vision Language Models: Smaller, Faster, Stronger", arXiv, 2023 (Google). [Paper]
    • COMM: "From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models", arXiv, 2023 (Huawei). [Paper][PyTorch (in construction)]
  • Video:
    • COOT: "COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning", NeurIPS, 2020 (University of Freiburg). [Paper][PyTorch]
    • Parameter-Reduction: "Parameter Efficient Multimodal Transformers for Video Representation Learning", ICLR, 2021 (Seoul National University). [Paper]
    • ClipBERT: "Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling", CVPR, 2021 (UNC + Microsoft). [Paper][PyTorch]
    • VLM: "VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding", ACL Findings, 2021 (Facebook). [Paper][PyTorch]
    • VideoCLIP: "VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding", EMNLP, 2021 (Facebook). [Paper][PyTorch]
    • VALUE: "VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation", NeurIPS (Datasets and Benchmarks), 2021 (Microsoft). [Paper][Website]
    • TAN: "Temporal Alignment Networks for Long-term Video", CVPR, 2022 (Oxford). [Paper][Code (in construction)][Website]
    • HD-VILA: "Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions", CVPR, 2022 (Microsoft). [Paper][GitHub]
    • ATP: "Revisiting the "Video" in Video-Language Understanding", CVPR, 2022 (Stanford). [Paper][Website]
    • ALPRO: "Align and Prompt: Video-and-Language Pre-training with Entity Prompts", CVPR, 2022 (Salesforce). [Paper][PyTorch]
    • CLOP: "CLOP: Video-and-Language Pre-Training with Knowledge Regularizations", ACMMM, 2022 (Baidu). [Paper]
    • LocVTP: "LocVTP: Video-Text Pre-training for Temporal Localization", ECCV, 2022 (Peking University). [Paper][PyTorch]
    • FineCo: "Contrastive Video-Language Learning with Fine-grained Frame Sampling", AACL, 2022 (ICL, UK). [Paper]
    • EMCL: "Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations", NeurIPS, 2022 (Peking University). [Paper][PyTorch]
    • LF-VILA: "Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning", NeurIPS, 2022 (Microsoft). [Paper][GitHub]
    • VATT-GR-CL: "Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization", NeurIPS, 2022 (Google). [Paper]
    • LGDN: "LGDN: Language-Guided Denoising Network for Video-Language Modeling", NeurIPS, 2022 (Renmin University of China). [Paper]
    • EgoVLP: "Egocentric Video-Language Pretraining", NeurIPS, 2022 (NUS). [Paper][PyTorch]
    • LiteVL: "LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling", EMNLP, 2022 (Peking University). [Paper]
    • Singularity: "Revealing Single Frame Bias for Video-and-Language Learning", arXiv, 2022 (UNC). [Paper]
    • VIOLET: "VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • SimVTP: "SimVTP: Simple Video Text Pre-training with Masked Autoencoders", arXiv, 2022 (Tencent). [Paper][PyTorch (in construction)]
    • VideoCoCa: "Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners", arXiv, 2022 (Google). [Paper]
    • i-Code: "i-Code: An Integrative and Composable Multimodal Learning Framework", AAAI, 2023 (Microsoft). [Paper][Code (in construction)]
    • TempCLR: "TempCLR: Temporal Alignment Representation with Contrastive Learning", ICLR, 2023 (Columbia). [Paper]
    • MELTR: "MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models", CVPR, 2023 (Korea University). [Paper][PyTorch]
    • VIOLETv2: "An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • LAVENDER: "LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling", CVPR, 2023 (Microsoft). [Paper][Code (in construction)]
    • SViTT: "SViTT: Temporal Learning of Sparse Video-Text Transformers", CVPR, 2023 (Intel). [Paper][Website]
    • TVTS: "Learning Transferable Spatiotemporal Representations from Natural Script Knowledge", CVPR, 2023 (Tencent). [Paper][PyTorch]
    • HBI: "Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning", CVPR, 2023 (Peking University). [Paper][Code (in construction)][Website]
    • All-in-One: "All in One: Exploring Unified Video-Language Pre-training", CVPR, 2023 (NUS). [Paper][PyTorch]
    • VindLU: "VindLU: A Recipe for Effective Video-and-Language Pretraining", CVPR, 2023 (UNC). [Paper][PyTorch]
    • Clover: "Clover: Towards A Unified Video-Language Alignment and Fusion Model", CVPR, 2023 (ByteDance). [Paper][PyTorch (in construction)]
    • mPLUG-2: "mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video", ICML, 2023 (Alibaba). [Paper][Code (in construction)]
    • BUS: "BUS: Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization", ICCV, 2023 (Alibaba). [Paper]
    • UMT: "Unmasked Teacher: Towards Training-Efficient Video Foundation Models", ICCV, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • ?: "Long-range Multimodal Pretraining for Movie Understanding", ICCV, 2023 (Adobe). [Paper]
    • EgoVLPv2: "EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone", ICCV, 2023 (Meta). [Paper][Website]
    • STOA-VLP: "STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training", arXiv, 2023 (Harbin Institute of Technology). [Papaer]
    • G-ViLM: "Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding", arXiv, 2023 (Google). [Paper]
    • VLAB: "VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending", arXiv, 2023 (ByteDance). [Paper]
    • i-Code-V2: "i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data", arXiv, 2023 (Microsoft). [Paper][PyTorch (in construction)]
    • TVTSv2: "TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • VFC: "Verbs in Action: Improving verb understanding in video-language models", arXiv, 2023 (Google). [Paper]
    • Youku-mPLUG: "Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks", arXiv, 2023 (Alibaba). [Paper]
    • VideoGLUE: "VideoGLUE: Video General Understanding Evaluation of Foundation Models", arXiv, 2023 (Google). [Paper]
    • InternVid: "InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • EVE: "EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE", arXiv, 2023 (Sun Yat-sen University). [Paper]
    • Qwen-VL: "Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • BT-Adapter: "One For All: Video Conversation is Feasible Without Video Instruction Tuning", arXiv, 2023 (Tencent). [Paper]
  • 3D:
    • CLIP2: "CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data", CVPR, 2023 (Huawei). [Paper]
    • 3D-VLP: "Context-aware Alignment and Mutual Masking for 3D-Language Pre-training", CVPR, 2023 (Sichuan University). [Paper][PyTorch]
    • SDFusion: "SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation", CVPR, 2023 (Snap). [Paper][PyTorch][Website]
    • 3D-VisTA: "3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment", ICCV, 2023 (Beijing Institute for General Artificial Intelligence (BIGAI)). [Paper][PyTorch][Website]
    • RegionPLC: "RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding", arXiv, 2023 (HKU). [Paper][Website]
    • 3DVLP: "Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding", arXiv, 2023 (Tsinghua). [Paper]
    • CLIPXPlore: "CLIPXPlore: Coupled CLIP and Shape Spaces for 3D Shape Exploration", arXiv, 2023 (CUHK). [Paper]
    • Point-PEFT: "Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • Vision-Audio-Text:
    • VATT: "VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text", NeurIPS, 2021 (Google). [Paper][Tensorflow]
    • VideoCC: "Learning Audio-Video Modalities from Image Captions", ECCV, 2022 (Google). [Paper][Website]
    • MUGEN: "MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration", ECCV, 2022 (Meta). [Paper][Website]
    • VATLM: "VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning", arXiv, 2022 (Microsoft). [Paper][PyTorch]
    • CLIP4VLA: "Accommodating Audio Modality in CLIP for Multimodal Processing", AAAI, 2023 (Renmin University of China). [Paper]
    • data2vec-2.0: "Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language", ICML, 2023 (Meta). [Paper][PyTorch]
    • VALOR: "VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset", arXiv, 2023 (CAS). [Paper][PyTorch][Website]
    • VAST: "VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset", arXiv, 2023 (CAS). [Paper]
  • More than 3 modalities:
    • Meta-Transformer: "Meta-Transformer: A Unified Framework for Multimodal Learning", arXiv, 2023 (CUHK). [Paper][Code (in construction)][Website]
    • UnIVAL: "Unified Model for Image, Video, Audio and Language Tasks", arXiv, 2023 (Sorbonne University, France). [Paper][PyTorch][Website]
    • ViT-Lens: "ViT-Lens: Towards Omni-modal Representations", arXiv, 2023 (Tencent). [Paper][PyTorch]

[Back to Overview]

Multi-Modal Retrieval

  • General:
    • Fast-and-Slow: "Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers", CVPR, 2021 (DeepMind). [Paper]
    • HTR: "Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning", CVPR, 2021 (Amazon). [Paper][PyTorch]
    • TERN: "Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features", CBMI, 2021 (National Research Council, Italy). [Paper]
    • VisualSparta: "VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search", arXiv, 2021 (CMU). [Paper]
    • CCR-CCS: "More Than Just Attention: Learning Cross-Modal Attentions with Contrastive Constraints", arXiv, 2021 (Rutgers + Amazon). [Paper]
    • MCProp: "Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching", ICLRW, 2022 (National Research Council, Italy). [Paper][PyTorch]
    • TASK-former: "A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch", ECCV, 2022 (Georgia Tech). [Paper][Website]
    • CODER: "CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval", ECCV, 2022 (Baidu). [Paper]
    • ?: "Most and Least Retrievable Images in Visual-Language Query Systems", ECCV, 2022 (Old Dominion University, Virginia). [Paper]
    • MACK: "MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching", NeurIPS, 2022 (CAS). [Paper]
    • MLA: "Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval", NeurIPS, 2022 (Renmin University of China). [Paper]
    • SpeechCLIP: "SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model", IEEE Workshop on Spoken Language Technology (SLT), 2022 (NTU). [Paper]
    • LoopITR: "LoopITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval", arXiv, 2022 (UNC). [Paper]
    • TNLBT: "Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training", arXiv, 2022 (The University of Electro-Communications, Japan). [Paper]
    • HiVLP: "HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval", arXiv, 2022 (Huawei). [Paper]
    • ?: "Revising Image-Text Retrieval via Multi-Modal Entailment". arXiv, 2022 (Soochow University, China). [Paper]
    • TokenFlow: "TokenFlow: Rethinking Fine-grained Cross-modal Alignment in Vision-Language Retrieval", arXiv, 2022 (Kuaishou). [Paper]
    • VLPCook: "Structured Vision-Language Pretraining for Computational Cooking", arXiv, 2022 (Sorbonne University, France). [Paper]
    • UniVL-DR: "Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal Retrieval", ICLR, 2023 (Northeastern University, China). [Paper]
    • HREM: "Learning Semantic Relationship Among Instances for Image-Text Matching", CVPR, 2023 (USTC). [Paper]
    • CHAN: "Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network", CVPR, 2023 (Zhejiang University). [Paper][PyTorch]
    • ViLEM: "ViLEM: Visual-Language Error Modeling for Image-Text Retrieval", CVPR, 2023 (CAS). [Paper]
    • SoftMask: "Multi-Modal Representation Learning with Text-Driven Soft Masks", CVPR, 2023 (SNU). [Paper]
    • MetaPer: "Meta-Personalizing Vision-Language Models To Find Named Instances in Video", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • DivE: "Improving Cross-Modal Retrieval with Set of Diverse Embeddings", CVPR, 2023 (POSTECH). [Paper][Website]
    • Pic2Word: "Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval", CVPR, 2023 (Google). [Paper][PyTorch]
    • ConaCLIP: "ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval", ACL Industry Track, 2023 (Alibaba). [Paper][PyTorch]
    • FNE: "Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination", ACMMM, 2023 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch]
    • HAT: "Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval", ACMMM, 2023 (University of Electronic Science and Technology of China (UESTC)). [Paper][PyTorch]
    • STAIR: "STAIR: Learning Sparse Text and Image Representation in Grounded Tokens", arXiv, 2023 (Apple). [Paper]
    • ChatIR: "Chatting Makes Perfect - Chat-based Image Retrieval", arXiv, 2023 (The Hebrew University of Jerusalem, Israel). [Paper]
    • TransAgg: "Zero-shot Composed Text-Image Retrieval", arXiv, 2023 (Shanghai Jiao Tong). [Paper][PyTorch][Website]
  • Video:
    • MMT: "Multi-modal Transformer for Video Retrieval", ECCV, 2020 (INRIA + Google). [Paper][Website]
    • AYCE: "All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers", CVPRW, 2021 (University of Modena and Reggio Emilia). [Paper][PyTorch]
    • HiT: "HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval", ICCV, 2021 (Kuaishou). [Paper]
    • Frozen: "Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval", ICCV, 2021 (Oxford). [Paper][Pytorch][Website][Dataset]
    • CLIP4Clip: "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval", arXiv, 2021 (Microsoft). [Paper][PyTorch]
    • MMFT: "Everything at Once - Multi-modal Fusion Transformer for Video Retrieval", CVPR, 2022 (Goethe University Frankfurt, Germany). [Paper]
    • X-Pool: "X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval", CVPR, 2022 (Layer 6 AI, Toronto). [Paper][PyTorch][Website]
    • MVPt: "It's Time for Artistic Correspondence in Music and Video", CVPR, 2022 (Adobe). [Paper][Website]
    • OA-Trans: "Object-aware Video-language Pre-training for Retrieval", CVPR, 2022 (NUS). [Paper][PyTorch]
    • BridgeFormer: "Bridging Video-text Retrieval with Multiple Choice Questions", CVPR, 2022 (HKU). [Paper][PyTorch][Website]
    • CenterCLIP: "CenterCLIP: Token Clustering for Efficient Text-Video Retrieval", SIGIR, 2022 (Zhejiang University). [Paper]
    • X-CLIP: "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval", ACMMM, 2022 (Alibaba). [Paper]
    • HiSE: "Boosting Video-Text Retrieval with Explicit High-Level Semantics", ACMMM, 2022 (Baidu). [Paper]
    • TS2-Net: "TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval", ECCV, 2022 (Tencent). [Paper][PyTorch]
    • LAFF: "Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval", ECCV, 2022 (Renmin University of China). [Paper]
    • ECLIPSE: "ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound", ECCV, 2022 (UNC). [Paper][PyTorch][Website]
    • MILES: "MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval", ECCV, 2022 (HKU). [Paper][PyTorch]
    • VTC: "VTC: Improving Video-Text Retrieval with User Comments", ECCV, 2022 (Unitary, UK). [Paper][PyTorch][Website]
    • LINAS: "Learning Linguistic Association towards Efficient Text-Video Retrieval", ECCV, 2022 (CAS). [Paper][PyTorch]
    • ?: "A Simple Transformer-Based Model for Ego4D Natural Language Queries Challenge", ECCVW, 2022 (UW-Madison). [Paper]
    • ?: "Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval", NeurIPS, 2022 (Sun Yat-sen University). [Paper]
    • ConTra: "ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval", ACCV, 2022 (University of Bristol, UK). [Paper]
    • RaP: "RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval", EMNLP, 2022 (CAS). [Paper][PyTorch]
    • MDMMT-2: "MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization", arXiv, 2022 (Huawei). [Paper]
    • M2HF: "M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval", arXiv, 2022 (Tencent). [Paper]
    • FIRE: "Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks", arXiv, 2022 (Meta). [Paper][PyTorch]
    • Cross-Modal-Adapter: "Cross-Modal Adapter for Text-Video Retrieval", arXiv, 2022 (Tsinghua University). [Paper][PyTorch (in construction)]
    • MAC: "Masked Contrastive Pre-Training for Efficient Video-Text Retrieval", arXiv, 2022 (Alibaba). [Paper]
    • CLIP-ViP: "CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment", ICLR, 2023 (Microsoft). [Paper][Code (in construction)]
    • HiREST: "Hierarchical Video-Moment Retrieval and Step-Captioning", CVPR, 2023 (UNC + Meta). [Paper][PyTorch][Website]
    • Cap4Video: "Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?", CVPR, 2023 (The University of Sydney). [Paper][PyTorch]
    • CLIPPING: "CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval", CVPR, 2023 (Huawei). [Paper]
    • CNVid-3.5M: "CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset", CVPR, 2023 (Ant Group). [Paper][GitHub (in construction)]
    • CelebV-Text: "CelebV-Text: A Large-Scale Facial Text-Video Dataset", CVPR, 2023 (University of Sydney). [Paper][GitHub][Website]
    • ReST: "Relational Space-Time Query in Long-Form Videos", CVPR, 2023 (Meta). [Paper]
    • NaQ: "NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory", CVPR, 2023 (UT Austin). [Paper][PyTorch][Website]
    • ?: "Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval", CVPR, 2023 (Columbia). [Paper][Code (in contruction)]
    • VoP: "VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval", CVPR, 2023 (Alibaba). [Paper][Code (in construction)][Website]
    • SpotEM: "SpotEM: Efficient Video Search for Episodic Memory", ICML, 2023 (UT Austin). [Paper][Website]
    • PromptSwitch: "Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval", ICCV, 2023 (University of Adelaide). [Paper][PyTorch (in construction)]
    • ?: "Simple Baselines for Interactive Video Retrieval with Questions and Answers", ICCV, 2023 (Princeton). [Paper][Code (in construction)]
    • MeVTR: "Multi-event Video-Text Retrieval", ICCV, 2023 (LMU Munich). [Paper][Code (in construction)]
    • In-Style: "In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval", ICCV, 2023 (MPI). [Paper][Code (in construction)]
    • ReGaDa: "Video-adverb retrieval with compositional adverb-action embeddings", BMVC, 2023 (University of Tübingen, Germany). [Paper][Code (in construction)][Website]
    • DiffusionRet: "DiffusionRet: Generative Text-Video Retrieval with Diffusion Model", arXiv, 2023 (Peking University). [Paper]
    • TextVR: "A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
    • MASCOT: "Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval", arXiv, 2023 (?). [Paper]
    • CrossTVR: "Fine-grained Text-Video Retrieval with Frozen Image Encoders", arXiv, 2023 (Alibaba). [Paper]
    • TEFAL: "Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment", arXiv, 2023 (Amazon). [Paper]
    • TeachCLIP: "TeachCLIP: Multi-Grained Teaching for Efficient Text-to-Video Retrieval", arXiv, 2023 (Renmin University of China). [Paper]
    • CoVR: "CoVR: Learning Composed Video Retrieval from Web Video Captions", arXiv, 2023 (Ecole des Ponts ParisTech (ENPC), France). [Paper][PyTorch][Website]
    • LanguageBind: "LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment", arXiv, 2023 (Peking). [Paper][PyTorch]
  • Vision-Audio-Text:
  • Others:
    • IRRA: "Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval", CVPR, 2023 (Wuhan University). [Paper][PyTorch]
    • ZS-SBIR: "CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not", CVPR, 2023 (University of Surrey, UK). [Paper][PyTorch]
    • ViML: "Language-Guided Music Recommendation for Video via Prompt Analogies", CVPR, 2023 (Adobe). [Paper][Website]
    • Auto-ACD: "A Large-scale Dataset for Audio-Language Representation Learning", arXiv, 2023 (Shanghai Jiao Tong). [Paper][Code (in construction)][Website]

[Back to Overview]

Multi-Modal Generation

  • General:
    • AttnGAN: "AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks", CVPR, 2018 (Microsoft). [Paper][PyTorch]
    • ControlGAN: "Controllable Text-to-Image Generation", NeurIPS, 2019 (Oxford). [Paper][PyTorch]
    • DALL-E: "Zero-Shot Text-to-Image Generation", ICML, 2021 (OpenAI). [Paper][PyTorch][PyTorch (lucidrains)]
    • CogView: "CogView: Mastering Text-to-Image Generation via Transformers", NeurIPS, 2021 (Tsinghua). [Paper][PyTorch][Website]
    • Layout-VQGAN: "Text-to-Image Synthesis Based on Object-Guided Joint-Decoding Transformer", CVPR, 2022 (CAS). [Paper]
    • Lafite: "Towards Language-Free Training for Text-to-Image Generation", CVPR, 2022 (Microsoft). [Paper][PyTorch]
    • LDM: "High-Resolution Image Synthesis with Latent Diffusion Models", CVPR, 2022 (LMU Munich). [Paper][PyTorch]
    • AvatarCLIP: "AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars", SIGGRAPH, 2022 (NTU, Singapore). [Paper][PyTorch][Website]
    • StoryDALL-E: "StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation", ECCV, 2022 (UNC). [Paper][PyTorch]
    • Make-A-Scene: "Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors", ECCV, 2022 (Meta). [Paper][Video]
    • TCTIG: "Trace Controlled Text to Image Generation", ECCV, 2022 (Beihang University). [Paper]
    • CogView2: "CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers", NeurIPS, 2022 (Tsinghua). [Paper][PyTorch]
    • CLIPDraw: "CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders", NeurIPS, 2022 (Cross Compass, Japan). [Paper][PyTorch][Blog]
    • Imagen: "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding", NeurIPS, 2022 (Google). [Paper][Website]
    • ?: "Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark", NeurIPSW, 2022 (Boston + MIT + Columbia). [Paper]
    • DALL-Eval: "DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers", arXiv, 2022 (UNC). [Paper][PyTorch]
    • DALL-E-2: "Hierarchical Text-Conditional Image Generation with CLIP Latents", arXiv, 2022 (OpenAI). [Paper][Website]
    • ?: "A very preliminary analysis of DALL-E 2", arXiv, 2022 (NYU). [Paper]
    • GLIDE: "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models", arXiv, 2022 (OpenAI). [Paper][PyTorch]
    • ?: "Discovering the Hidden Vocabulary of DALLE-2", arXiv, 2022 (UT Austin). [Paper]
    • Parti: "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation", arXiv, 2022 (Google). [Paper][GitHub][Website]
    • Textual-Inversion: "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion", arXiv, 2022 (NVIDIA). [Paper][Website]
    • VLMGAN: "Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks", arXiv, 2022 (Fudan University). [Paper]
    • PDM: "Progressive Denoising Model for Fine-Grained Text-to-Image Generation", arXiv, 2022 (Meituan). [Paper]
    • FS-VQG: "Few-Shot Visual Question Generation: A Novel Task and Benchmark Datasets", arXiv, 2022 (IIT Kharagpur). [Paper]
    • Swinv2-Imagen: "Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation", arXiv, 2022 (Auckland University of Technology). [Paper]
    • UniTune: "UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image", arXiv, 2022 (Google). [Paper]
    • VSD: "Visual Spatial Description: Controlled Spatial-Oriented Image-to-Text Generation", arXiv, 2022 (Tianjin University). [Paper][Code (in construction)]
    • Lafite2: "Lafite2: Few-shot Text-to-Image Generation", arXiv, 2022 (SUNY, Buffalo). [Paper]
    • eDiffi: "eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers", arXiv, 2022 (NVIDIA). [Paper][Website]
    • SpaText: "SpaText: Spatio-Textual Representation for Controllable Image Generation", arXiv, 2022 (Meta). [Paper][Website]
    • Story-LDM: "Make-A-Story: Visual Memory Conditioned Consistent Story Generation", arXiv, 2022 (UBC + Snap). [Paper]
    • Structure-Diffusion: "Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis", arXiv, 2022 (UCSB + UC Santa Cruz). [Paper][PyTorch][Website]
    • Re-Imagen: "Re-Imagen: Retrieval-Augmented Text-to-Image Generator", ICLR, 2023 (Google). [Paper]
    • Prompt-to-Prompt: "Prompt-to-Prompt Image Editing with Cross Attention Control", ICLR, 2023 (Google). [Paper][PyTorch][Website]
    • UniD3: "Unified Discrete Diffusion for Simultaneous Vision-Language Generation", ICLR, 2023 (NTU, Singapore). [Paper]
    • T2P: "Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation", CVPR, 2023 (Fuxi AI Lab). [Paper]
    • GLIGEN: "GLIGEN: Open-Set Grounded Text-to-Image Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch][Website]
    • MAGVLT: "MAGVLT: Masked Generative Vision-and-Language Transformer", CVPR, 2023 (Kakao). [Paper]
    • ReCo: "ReCo: Region-Controlled Text-to-Image Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • GALIP: "GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis", CVPR, 2023 (Nanjing University of Posts and Telecommunications). [Paper][PyTorch]
    • DreamBooth: "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation", CVPR, 2023 (Google). [Paper][GitHub][Website]
    • RIATIG: "RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts", CVPR, 2023 (Washington University in St. Louis). [Paper]
    • ERNIE-ViLG-2.0: "ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts", CVPR, 2023 (Baidu). [Paper][Website]
    • GigaGAN: "Scaling up GANs for Text-to-Image Synthesis", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • Shifted-Diffusion: "Shifted Diffusion for Text-to-image Generation", CVPR, 2023 (ByteDance). [Paper][PyTorch]
    • Specialist-Diffusion: "Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style", CVPR, 2023 (Picsart). [Paper][Website]
    • ?: "Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation", CVPR, 2023 (CyberAgent, Japan). [Paper]
    • Custom-Diffusion: "Multi-Concept Customization of Text-to-Image Diffusion", CVPR, 2023 (Adobe). [Paper]
    • UniDiffuser: "One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale", ICML, 2023 (Tsinghua University). [Paper][Pytorch]
    • Muse: "Muse: Text-To-Image Generation via Masked Generative Transformers", ICML, 2023 (Google). [Paper][Website]
    • RA-CM3: "Retrieval-Augmented Multimodal Language Modeling", ICML, 2023 (Meta). [Paper]
    • StyleGAN-T: "StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis", ICML, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • VD: "Versatile Diffusion: Text, Images and Variations All in One Diffusion Model", ICCV, 2023 (Oregon). [Paper][PyTorch]
    • DiT: "Scalable Diffusion Models with Transformers", ICCV, 2023 (Meta). [Paper][PyTorch][Website]
    • E4T: "Designing an Encoder for Fast Personalization of Text-to-Image Models", arXiv, 2023 (NVIDIA). [Paper][Website]
    • ?: "Controlled and Conditional Text to Image Generation with Diffusion Prior", arXiv, 2023 (Adobe). [Paper]
    • Lformer: "Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding", arXiv, 2023 (Zhejiang University). [Paper]
    • UMM-Diffusion: "Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation", arXiv, 2023 (Peking University). [Paper]
    • TIFA: "TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering", arXiv, 2023 (UW). [Paper][Code (in construction)][Website]
    • ToMESD: "Token Merging for Fast Stable Diffusion", arXiv, 2023 (Georgia Tech). [Paper][PyTorch]
    • layout-guidance: "Training-Free Layout Control with Cross-Attention Guidance", arXiv, 2023 (Oxford). [Paper][PyTorch][Website]
    • HRS-Bench: "HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models", arXiv, 2023 (KAUST). [Paper][GitHub][Website]
    • SeedSelect: "It is all about where you start: Text-to-image generation with seed selection", arXiv, 2023 (Bar-Ilan University, Israel). [Paper]
    • DisenBooth: "DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation", arXiv, 2023 (Tsinghua). [Paper]
    • VideoOFA: "VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation", arXiv, 2023 (Meta). [Paper]
    • FastComposer: "FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention", arXiv, 2023 (MIT). [Paper][PyTorch][Website]
    • LLMScore: "LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation", arXiv, 2023 (UCSB). [Paper][PyTorch]
    • CoDi: "Any-to-Any Generation via Composable Diffusion", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • ?: "The CLIP Model is Secretly an Image-to-Prompt Converter", arXiv, 2023 (Xidian University). [Paper]
    • PoS-subspaces: "Parts of Speech-Grounded Subspaces in Vision-Language Models", arXiv, 2023 (Queen Mary University of London). [Paper][PyTorch (in construction)][Website]
    • VPGen: "Visual Programming for Text-to-Image Generation and Evaluation", arXiv, 2023 (UNC). [Paper][PyTorch][Website]
    • BLIP-Diffusion: "BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing", arXiv, 2023 (Salesforce). [Paper][Code (in construction)][Website]
    • SeeCoder: "Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models", arXiv, 2023 (Picsart). [Paper][PyTorch]
    • GILL: "Generating Images with Multimodal Language Models", arXiv, 2023 (CMU). [Paper][Code (in construction)][Website]
    • CAC: "Localized Text-to-Image Generation for Free via Cross Attention Control", arXiv, 2023 (CMU). [Paper]
    • CLIPAG: "CLIPAG: Towards Generator-Free Text-to-Image Generation", arXiv, 2023 (Technion, Israel). [Paper]
    • PACGen: "Generate Anything Anywhere in Any Scene", arXiv, 2023 (UW Madison). [Paper][Code (in construction)][Website]
    • SPAE: "SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs", arXiv, 2023 (Google). [Paper]
    • DA-Score: "Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback", arXiv, 2023 (ANU). [Paper][Code (in construction)][Website]
    • HyperDreamBooth: "HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models", arXiv, 2023 (Google). [Paper][Website]
    • ?: "Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models", arXiv, 2023 (NVIDIA). [Paper][Website]
    • GORS: "T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation", arXiv, 2023 (HKU). [Paper][Website][PyTorch]
    • IP-Adapter: "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models", arXiv, 2023 (Tencent). [Paper][Website]
    • ORES: "ORES: Open-vocabulary Responsible Visual Synthesis", arXiv, 2023 (Microsoft). [Paper]
    • CM3Leon: "Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning", arXiv, 2023 (Meta). [Paper]
    • DreamLLM: "DreamLLM: Synergistic Multimodal Comprehension and Creation", arXiv, 2023 (Megvii). [Paper][Code (in construction)][Website]
    • FreeU: "FreeU: Free Lunch in Diffusion U-Net", arXiv, 2023 (NTU, Singapore). [Paper][Website][Code (in construction)]
    • Emu: "Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack", arXiv, 2023 (Meta). [Paper]
    • PixArt-α: "PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis", arXiv, 2023 (Huawei). [Paper][Website]
    • Kosmos-G: "Kosmos-G: Generating Images in Context with Multimodal Large Language Models", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • AlignProp: "Aligning Text-to-Image Diffusion Models with Reward Backpropagation", arXiv, 2023 (CMU). [Paper][PyTorch][Website]
    • Idea2Img: "Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation", arXiv, 2023 (Microsoft). [Paper][Website]
    • EasyGen: "Making Multimodal Generation Easier: When Diffusion Models Meet LLMs", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • Video:
    • Imagen-Video: "Imagen Video: High Definition Video Generation with Diffusion Models", arXiv, 2022 (Google). [Paper][Website]
    • Phenaki: "Phenaki: Variable Length Video Generation From Open Domain Textual Description", arXiv, 2022 (Google). [Paper][PyTorch (LAION-AI, in construction)][Website]
    • ?: "Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization", arXiv, 2022 (CMU). [Paper][PyTorch][Website]
    • MagicVideo: "MagicVideo: Efficient Video Generation With Latent Diffusion Models", arXiv, 2022 (ByteDance). [Paper][Website]
    • CogVideo: "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers", ICLR, 2023 (Tsinghua University) [Paper][GitHub (in construction)]
    • Make-A-Video: "Make-A-Video: Text-to-Video Generation without Text-Video Data", ICLR, 2023 (Meta). [Paper]
    • VideoLDM: "Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models", CVPR, 2023 (NVIDIA). [Paper][Website]
    • MMVG: "Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation", CVPR, 2023 (Meta). [Paper]
    • MM-Diffusion: "MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation", CVPR, 2023 (Microsoft). [Paper][PyTorch]
    • PYoCo: "Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models", ICCV, 2023 (NVIDIA). [Paper][Website]
    • Text2Video-Zero: "Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators", ICCV, 2023 (Picsart). [Paper][Code (in construction)]
    • Text2Performer: "Text2Performer: Text-Driven Human Video Generation", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)][Website]
    • VideoFactory: "VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation", arXiv, 2023 (Microsoft). [Paper]
    • Video-Adapter: "Probabilistic Adaptation of Text-to-Video Models", arXiv, 2023 (DeepMind). [Paper][Website]
    • SimDA: "SimDA: Simple Diffusion Adapter for Efficient Video Generation", arXiv, 2023 (Fudan). [Paper][Website]
    • LVD: "LLM-grounded Video Diffusion Models", arXiv, 2023 (Berkeley). [Paper][Code (in construction)][Website]
  • 3D:
    • Magic3D: "Magic3D: High-Resolution Text-to-3D Content Creation", CVPR, 2023 (NVIDIA). [Paper][Website]
    • CLIP-Sculptor: "CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Natural Language", CVPR, 2023 (Autodesk). [Paper][Website]
    • Diffusion-SDF: "Diffusion-SDF: Text-to-Shape via Voxelized Diffusion", CVPR, 2023 (Tsinghua). [Paper][PyTorch][Website]
    • TAPS3D: "TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision", CVPR, 2023 (Bytedance). [Paper][PyTorch]
    • Dream3D: "Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models", CVPR, 2023 (Tencent). [Paper][Website]
    • ATT3D: "ATT3D: Amortized Text-To-3D Object Synthesis", arXiv, 2023 (NVIDIA). [Paper][Website]
    • InstructP2P: "InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions", arXiv, 2023 (Tencent). [Paper]
    • ATT3D: "ATT3D: Amortized Text-to-3D Object Synthesis", arXiv, 2023 (NVIDIA). [Paper][Website]
    • SDS-Complete: "Point-Cloud Completion with Pretrained Text-to-image Diffusion Models", arXiv, 2023 (NVIDIA). [Paper][Website]
    • Michelangelo: "Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation", arXiv, 2023 (Tencent). [Paper][[Code (in construction)(https://github.com/NeuralCarver/michelangelo)]][Website]
    • DiffTF: "Large-Vocabulary 3D Diffusion Model with Transformer", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)][Website]
  • Others:
    • DiffGesture: "Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation", CVPR, 2023 (HKU). [Paper][PyTorch]
    • CondFoleyGen: "Conditional Generation of Audio from Video via Foley Analogies", CVPR, 2023 (UMich). [Paper][PyTorch (in construction)][Website]
    • Physics-Diffusion: "Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos", CVPR, 2023 (IBM). [Paper][PyTorch][Website]
    • RACER: "Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards", CVPR, 2023 (Dalian University of Technology). [Paper][Code (in construction)]
    • ReVISE: "ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
    • MAV3D: "Text-To-4D Dynamic Scene Generation", ICML, 2023 (Meta). [Paper][Website]
    • LORIS: "Long-Term Rhythmic Video Soundtracker", ICML, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • NExT-GPT: "NExT-GPT: Any-to-Any Multimodal LLM", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]

[Back to Overview]

Prompt Learning/Tuning:

  • CLIP-Adapter: "CLIP-Adapter: Better Vision-Language Models with Feature Adapters", arXiv, 2021 (Shanghai AI Lab). [Paper][PyTorch]
  • CoCoOp: "Conditional Prompt Learning for Vision-Language Models", CVPR, 2022 (NTU, Singapore). [Paper][PyTorch]
  • ProDA: "Prompt Distribution Learning", CVPR, 2022 (Huawei). [Paper]
  • VPT: "Visual Prompt Tuning", ECCV, 2022 (Cornell). [Paper][PyTorch]
  • PerVL: "This is my unicorn, Fluffy": Personalizing frozen vision-language representations", ECCV, 2022 (NVIDIA). [Paper][PyTorch]
  • OrdinalCLIP: "OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression", NeurIPS, 2022 (Tsinghua University). [Paper][PyTorch]
  • BeamCLIP: "Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching", NeurIPS, 2022 (LG). [Paper]
  • TPT: "Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models", NeurIPS, 2022 (NVIDIA). [Paper][PyTorch][Website]
  • CoOp: "Learning to Prompt for Vision-Language Models", IJCV, 2022 (NTU, Singapore). [Paper][PyTorch]
  • LASP: "Language-Aware Soft Prompting for Vision & Language Foundation Models", CVPR, 2023 (Samsung). [Paper][Website]
  • VPT: "Variational prompt tuning improves generalization of vision-language models", arXiv, 2022 (Samsung). [Paper]
  • CAVPT: "Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model", arXiv, 2022 (Northwestern Polytechnical University, China). [Paper]
  • Visual-Prompting: "Exploring Visual Prompts for Adapting Large-Scale Models", arXiv, 2022 (MIT). [Paper][PyTorch][Website]
  • PGN: "Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers", arXiv, 2022 (University of Amsterdam). [Paper][PyTorch]
  • UPT: "Unified Vision and Language Prompt Learning", arXiv, 2022 (NTU, Singapore). [Paper][Code (in construction)]
  • CPL: "CPL: Counterfactual Prompt Learning for Vision and Language Models", arXiv, 2022 (UC Santa Cruz). [Paper]
  • PTP: "Prompting through Prototype: A Prototype-based Prompt Learning on Pretrained Vision-Language Models", arXiv, 2022 (Baidu). [Paper]
  • MVLPT: "Multitask Vision-Language Prompt Tuning", arXiv, 2022 (Berkeley). [Paper][PyTorch]
  • ?: "Task Bias in Vision-Language Models", arXiv, 2022 (Columbia). [Paper]
  • UPL: "Unsupervised Prompt Learning for Vision-Language Models", arXiv, 2022 (Peking). [Paper][PyTorch]
  • DeFo: "Learning to Decompose Visual Features with Latent Textual Prompts", ICLR, 2023 (UIUC). [Paper]
  • PLOT: "Prompt Learning with Optimal Transport for Vision-Language Models", ICLR, 2023 (CMU). [Paper]
  • ?: "Visual Classification via Description from Large Language Models", ICLR, 2023 (Columbia). [Paper]
  • CSP: "Learning to Compose Soft Prompts for Compositional Zero-Shot Learning", ICLR, 2023 (Brown University). [Paper][PyTorch]
  • CaFo: "Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
  • ?: "Multimodal Prompting with Missing Modalities for Visual Recognition", CVPR, 2023 (NYCU). [Paper][PyTorch][Website]
  • DAM-VP: "Diversity-Aware Meta Visual Prompting", CVPR, 2023 (USTC). [Paper][PyTorch]
  • ILM-VP: "Understanding and Improving Visual Prompting: A Label-Mapping Perspective", CVPR, 2023 (Michigan State). [Paper][PyTorch]
  • KgCoOp: "Visual-Language Prompt Tuning with Knowledge-guided Context Optimization", CVPR, 2023 (CAS). [Paper][PyTorch]
  • BlackVIP: "BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning", CVPR, 2023 (University of Seoul). [Paper][PyTorch]
  • EXPRES: "Learning Expressive Prompting With Residuals for Vision Transformers", CVPR, 2023 (Amazon). [Paper]
  • ?: "Learning to Name Classes for Vision and Language Models", CVPR, 2023 (Huawei). [Paper]
  • PMF: "Efficient Multimodal Fusion via Interactive Prompting", CVPR, 2023 (Zhejiang University). [Paper]
  • MaPLe: "MaPLe: Multi-modal Prompt Learning", CVPR, 2023 (MBZUAI). [Paper][PyTorch]
  • HiPro: "Hierarchical Prompt Learning for Multi-Task Learning", CVPR, 2023 (JD). [Paper]
  • DFSP: "Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning", CVPR, 2023 (The Hong Kong Polytechnic University). [Paper][PyTorch]
  • TaI-DP: "Texts as Images in Prompt Tuning for Multi-Label Image Recognition", CVPR, 2023 (Tomorrow Advancing Life (TAL)). [Paper][PyTorch]
  • ESPER: "Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning", CVPR, 2023 (Yonsei). [Paper][PyTorch]
  • APT: "A-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting", CVPR, 2023 (Amazon). [Paper]
  • VQT: "Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning", CVPR, 2023 (The Ohio State University (OSU)). [Paper]
  • LaBo: "Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification", CVPR, 2023 (University of Pennsylvania). [Paper][PyTorch]
  • TaskRes: "Task Residual for Tuning Vision-Language Models", CVPR, 2023 (NUS). [Paper][PyTorch]
  • POUF: "POUF: Prompt-oriented unsupervised fine-tuning for large pre-trained models", ICML, 2023 (UT Austin). [Paper][PyTorch]
  • ?: "Improving Visual Prompt Tuning for Self-supervised Vision Transformers", ICML, 2023 (SNU). [Paper][PyTorch]
  • ZPE: "A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models", ICML, 2023 (Google). [Paper]
  • CMPA: "Deeply Coupled Cross-Modal Prompt Learning", ACL Findings, 2023 (SenseTime). [Paper]
  • PromptSRC: "Self-regulating Prompts: Foundational Model Adaptation without Forgetting", ICCV, 2023 (MBZUAI). [Paper][PyTorch][Website]
  • SHIP: "Improving Zero-Shot Generalization for CLIP with Synthesized Prompts", ICCV, 2023 (CAS). [Paper]
  • PTNL: "Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?", ICCV, 2023 (ByteDance). [Paper]
  • E2VPT: "E2VPT: An Effective and Efficient Approach for Visual Prompt Tuning", ICCV, 2023 (Rochester Institute of Technology, NY). [Paper][PyTorch]
  • R-AMT: "Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models", ICCV, 2023 (Zhejiang University). [Paper][Code (in construction)][Website]
  • DiffTPT: "Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning", ICCV, 2023 (A*STAR). [Paper][PyTorch (in construction)]
  • KAPT: "Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models", ICCV, 2023 (Southern University of Science and Technology (SUSTech)). [Paper]
  • RPO: "Read-only Prompt Optimization for Vision-Language Few-shot Learning", ICCV, 2023 (Korea University). [Paper][PyTorch]
  • LoGoPrompt: "LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models", ICCV, 2023 (ShanghaiTech). [Paper][Website]
  • DAPT: "Distribution-Aware Prompt Tuning for Vision-Language Models", ICCV, 2023 (Korea University). [Paper][Code (in construction)]
  • GOPro: "GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised Learning", BMVC, 2023 (IIT Bombay). [Paper][Code (in construction)]
  • ALIGN: "Tuning Multi-mode Token-level Prompt Alignment across Modalities", NeurIPS, 2023 (Xidian University). [Paper][Code (in construction)]
  • GraphAdapter: "GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph", NeurIPS, 2023 (NUS). [Paper][Code (in construction)]
  • SeMap: "From Visual Prompt Learning to Zero-Shot Transfer: Mapping Is All You Need", arXiv, 2023 (CISPA, Germany). [Paper]
  • R-Tuning: "R-Tuning: Regularized Prompt Tuning in Open-Set Scenarios", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
  • VPTM: "Rethinking Visual Prompt Learning as Masked Visual Token Modeling", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
  • GRAM: "Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models", arXiv, 2023 (Huawei). [Paper]
  • PBPrompt: "Patch-Token Aligned Bayesian Prompt Learning for Vision-Language Models", arXiv, 2023 (Xidian University). [Paper]
  • CTP-TFT: "Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models", arXiv, 2023 (Baidu). [Paper]
  • POMP: "Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition", arXiv, 2023 (Amazon). [Paper][PyTorch]
  • ?: "What does CLIP know about a red circle? Visual prompt engineering for VLMs", arXiv, 2023 (Oxford). [Paper]
  • Robust-ProL: "Towards Robust Prompts on Vision-Language Models", arXiv, 2023 (Google). [Paper]
  • ProVP: "Progressive Visual Prompt Learning with Contrastive Feature Re-formation", arXiv, 2023 (vivo, China). [Paper]
  • ?: "Chain of Thought Prompt Tuning in Vision Language Models", arXiv, 2023 (Peking University). [Paper]
  • Instruction-ViT: "Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT", arXiv, 2023 (University of Electronic Science and Technology of China). [Paper]
  • VPGTrans: "Transfer Visual Prompt Generator across LLMs", arXiv, 2023 (NUS). [Paper][PyTorch][Website]
  • DRPT: "DRPT: Disentangled and Recurrent Prompt Tuning for Compositional Zero-Shot Learning", arXiv, 2023 (Hong Kong Polytechnic University). [Paper][Code (in construction)]
  • VCoT: "Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings", arXiv, 2023 (UCSB). [Paper]
  • PMPO: "Multi-Prompt with Depth Partitioned Cross-Modal Learning", arXiv, 2023 (CAS). [Paper]
  • Aurora: "Mode Approximation Makes Good Vision-Language Prompts", arXiv, 2023 (Peking). [Paper][PyTorch]
  • DSD: "Discriminative Diffusion Models as Few-shot Vision and Language Learners", arXiv, 2023 (Google). [Paper]
  • PLID: "Prompting Language-Informed Distribution for Compositional Zero-Shot Learning", arXiv, 2023 (Michigan State). [Paper]
  • ConES: "ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models", arXiv, 2023 (Sichuan University). [Paper]
  • LaFTer: "LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections", arXiv, 2023 (TU Graz, Austria). [Paper]
  • ?: "Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning", arXiv, 2023 (Brown). [Paper][PyTorch]
  • CoPrompt: "Consistency-guided Prompt Learning for Vision-Language Models", arXiv, 2023 (Queen’s University, Canada). [Paper]
  • ProTeCt: "ProTeCt: Prompt Tuning for Hierarchical Consistency", arXiv, 2023 (UCSD). [Paper]
  • FGVP: "Fine-Grained Visual Prompting", arXiv, 2023 (BAAI). [Paper]
  • POP: "POP: Prompt Of Prompts for Continual Learning", arXiv, 2023 (Qualcomm). [Paper]
  • GAVIE: "Aligning Large Multi-Modal Model with Robust Instruction Tuning", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
  • NPT: "Bridging the Gap: Neural Collapse Inspired Prompt Tuning for Generalization under Class Imbalance", arXiv, 2023 (Zhejiang University). [Paper]
  • APT: "Approximated Prompt Tuning for Vision-Language Pre-trained Models", arXiv, 2023 (Xiamen University). [Paper]
  • CoPL: "Contextual Prompt Learning for Vision-Language Understanding", arXiv, 2023 (Adobe). [Paper]
  • CiP: "Image Captions are Natural Prompts for Text-to-Image Models", arXiv, 2023 (The University of Sydney). [Paper]
  • UP-DP: "UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models", arXiv, 2023 (Bosch). [Paper]
  • DPL: "DPL: Decoupled Prompt Learning for Vision-Language Models", arXiv, 2023 (vivo). [Paper]
  • DuAl-PT: "Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment", arXiv, 2023 (ByteDance). [Paper]
  • DePT: "DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning", arXiv, 2023 (UCL). [Paper][PyTorch]
  • Prompting4Debugging: "Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts", arXiv, 2023 (NYCU). [Paper]
  • ?: "Language Models as Black-Box Optimizers for Vision-Language Models", arXiv, 2023 (CMU). [Paper]
  • DePT: "DePT: Decoupled Prompt Tuning", arXiv, 2023 (University of Electronic Science and Technology of China). [Paper][PyTorch]

[Back to Overview]

Visual Document Understanding

  • LayoutLMv2: "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding", ACL, 2021 (Microsoft). [Paper][PyTorch]
  • DocFormer: "DocFormer: End-to-End Transformer for Document Understanding", ICCV, 2021 (Amazon). [Paper]
  • StrucTexT: "StrucTexT: Structured Text Understanding with Multi-Modal Transformers", ACMMM, 2021 (Baidu). [Paper][Paddle]
  • LayoutXLM: "LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding", arXiv, 2021 (Microsoft). [Paper][PyTorch]
  • TableFormer: "TableFormer: Table Structure Understanding with Transformers", CVPR, 2022 (IBM). [Paper]
  • TSRFormer: "TSRFormer: Table Structure Recognition with Transformers", ACMMM, 2022 (Microsoft). [Paper]
  • ERNIE-mmLayout: "ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding", ACMMM, 2022 (Baidu). [Paper]
  • Donut: "Donut: Document Understanding Transformer without OCR", ECCV, 2022 (NAVER). [Paper][PyTorch]
  • I2DFormer: "I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification", NeurIPS, 2022 (ETHZ). [Paper]
  • MGDoc: "MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding", EMNLP, 2022 (Adobe). [Paper]
  • DocEnTr: "DocEnTr: An End-to-End Document Image Enhancement Transformer", arXiv, 2022 (UAB, Spain). [Paper][PyTorch]
  • DocSegTr: "DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer", arXiv, 2022 (UAB, Spain). [Paper]
  • DiT: "DiT: Self-supervised Pre-training for Document Image Transformer", arXiv, 2022 (Microsoft). [Paper][Code (in construction)]
  • LayoutLMv3: "LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • MATrIX: "MATrIX - Modality-Aware Transformer for Information eXtraction", arXiv, 2022 (Amazon). [Paper]
  • VLCDoC: "VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification", arXiv, 2022 (La Rochelle University, France). [Paper]
  • Bi-VLDoc: "Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding", arXiv, 2022 (Alibaba). [Paper]
  • TRUST: "TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers", arXiv, 2022 (Baidu). [Paper]
  • Hi-VT5: "Hierarchical multimodal transformers for Multi-Page DocVQA", arXiv, 2022 (UAB, Spain). [Paper]
  • OCR-VQGAN: "OCR-VQGAN: Taming Text-within-Image Generation", WACV, 2023 (UAB, Spain). [Paper]
  • PIXEL: "Language Modelling with Pixels", ICLR, 2023 (University of Copenhagen, Denmark). [Paper]
  • Spotlight: "Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus", ICLR, 2023 (Google). [Paper]
  • MaskDoc: "Masked Visual-Textual Prediction for Document Image Representation Pretraining", ICLR, 2023 (Baidu). [Paper]
  • StrucTexTv2: "StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training", ICLR, 2023 (Baidu). [Paper][Paddle]
  • FlexDM: "Towards Flexible Multi-modal Document Models", CVPR, 2023 (CyberAgent, Japan). [Paper][Tensorflow][Website]
  • MUI: "Mobile User Interface Element Detection Via Adaptively Prompt Tuning", CVPR, 2023 (Ant Group). [Paper][GitHub (in construction)]
  • UDOP: "Unifying Vision, Text, and Layout for Universal Document Processing", CVPR, 2023 (Microsoft). [Paper][PyTorch]
  • M6Doc: "M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis", CVPR, 2023 (South China University of Technology). [Paper][GitHub]
  • VGT: "Vision Grid Transformer for Document Layout Analysis", ICCV, 2023 (Alibaba). [Paper][PyTorch]
  • SeRum: "Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration", ICCV, 2023 (Tencent). [Paper]
  • DocTr: "DocTr: Document Transformer for Structured Information Extraction in Documents", ICCV, 2023 (Amazon). [Paper]
  • FormNetV2: "FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction", ACL, 2023 (Google). [Paper]
  • mmc4: "Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text", arXiv, 2023 (AI2). [Paper][GitHub (in construction)]
  • DUBLIN: "DUBLIN - Document Understanding By Language-Image Network", arXiv, 2023 (Microsoft). [Paper]
  • DocFormerv2: "DocFormerv2: Local Features for Document Understanding", arXiv, 2023 (Amazon). [Paper]
  • DocumentCLIP: "DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents", arXiv, 2023 (Adobe). [Paper][PyTorch]
  • Kosmos-2.5: "Kosmos-2.5: A Multimodal Literate Model", arXiv, 2023 (Microsoft). [Paper]
  • UReader: "UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model", arXiv, 2023 (Alibaba). [Paper]

[Back to Overview]

Other Multi-Modal Tasks

  • Transfer Learning/Adaptation/Distillation:
    • FLYP: "Finetune like you pretrain: Improved finetuning of zero-shot vision models", CVPR, 2023 (CMU). [Paper][PyTorch]
    • Pi-Tuning: "Pi-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation", ICML, 2023 (HKU). [Paper][Code (in construction)]
    • OCRA: "Cross-Modal Fine-Tuning: Align then Refine", ICML, 2023 (CMU + HP). [Paper][PyTorch]
    • TeS: "Improved Visual Fine-tuning with Natural Language Supervision", arXiv, 2023 (Alibaba). [Paper]
    • Paxion: "Paxion: Patching Action Knowledge in Video-Language Foundation Models", arXiv, 2023 (UIUC). [Paper][PyTorch]
    • RLCF: "Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models", arXiv, 2023 (Zhejiang University). [Paper][Code (in construction)]
    • LMAT: "Can Large Pre-trained Models Help Vision Models on Perception Tasks?", arXiv, 2023 (Huawei). [Paper][Website (in construction)]
    • TaCA: "TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • ProbVLM: "ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models", arXiv, 2023 (University of Tubingen, Germany). [Paper]
    • CLIP-KD: "CLIP-KD: An Empirical Study of Distilling CLIP Models", arXiv, 2023 (CAS). [Paper][Code (in construction)]
  • Zero-Shot:
    • CuPL: "What does a platypus look like? Generating customized prompts for zero-shot image classification", arXiv, 2022 (UW). [Paper][PyTorch]
    • SMs: "Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language", ICLR, 2023 (Google). [Paper][GitHub][Website]
    • iCLIP: "iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition", CVPR, 2023 (Microsoft). [Paper]
    • DiffDis: "DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability", ICCV, 2023 (Huawei). [Paper]
    • V-GLOSS: "Visually-Grounded Descriptions Improve Zero-Shot Image Classification", arXiv, 2023 (University of Alberta, Canada). [Paper]
    • ?: "Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness", arXiv, 2023 (Amazon). [Paper]
    • UniFine: "UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding", arXiv, 2023 (Columbia). [Paper][Code (in construction)]
    • Cheetah: "Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions", arXiv, 2023 (Zhejiang). [Paper]
  • X-Shot:
    • Tip-Adapter: "Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification", ECCV, 2022 (Shanghai AI Lab). [Paper][PyTorch]
    • VidIL: "Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners", NeurIPS, 2022 (UIUC). [Paper][PyTorch]
    • ComCLIP: "ComCLIP: Training-Free Compositional Image and Text Matching", arXiv, 2022 (UC Santa Cruz). [Paper]
    • TCT: "Efficient Zero-shot Visual Search via Target and Context-aware Transformer", arXiv, 2022 (Baylor College of Medicine, TX). [Paper]
    • ?: "Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning", ICLR, 2023 (University of Amsterdam). [Paper]
    • ?: "Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models", CVPR, 2023 (CMU). [Paper]
    • SADA: "Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment", CVPR, 2023 (Huawei). [Paper][PyTorch]
    • APE: "Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • LFA: "Black Box Few-Shot Adaptation for Vision-Language models", arXiv, 2023 (Samsung). [Paper]
    • ?: "Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime", arXiv, 2023 (DeepMind). [Paper]
    • Proto-CLIP: "Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning", arXiv, 2023 (UT Dallas). [Paper]
    • NtUA: "Noise-Tolerant Unsupervised Adapter for Vision-Language Models", arXiv, 2023 (MBZUAI). [Paper]
    • SeCAt: "Small Visual Language Models can also be Open-Ended Few-Shot Learners", arXiv, 2023 (UvA). [Paper]
  • Referring Image Segmentation:
    • VLT: "Vision-Language Transformer and Query Generation for Referring Segmentation", ICCV, 2021 (NTU, Singapore). [Paper][Tensorflow]
    • CRIS: "CRIS: CLIP-Driven Referring Image Segmentation", CVPR, 2022 (University of Sydney). [Paper]
    • LAVT: "LAVT: Language-Aware Vision Transformer for Referring Image Segmentation", CVPR, 2022 (Oxford). [Paper]
    • ReSTR: "ReSTR: Convolution-free Referring Image Segmentation Using Transformers", CVPR, 2022 (POSTECH). [Paper][Website]
    • ReCLIP: "ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension", ACL, 2022 (AI2). [Paper]
    • TSEG: "Weakly-supervised segmentation of referring expressions", arXiv, 2022 (INRIA). [Paper]
    • ZS-RIS: "Zero-shot Referring Image Segmentation with Global-Local Context Features", CVPR, 2023 (Gwangju Institute of Science and Technology (GIST)). [Paper][PyTorch]
    • PolyFormer: "PolyFormer: Referring Image Segmentation as Sequential Polygon Generation", CVPR, 2023 (Amazon). [Paper][Website]
    • MCRES: "Meta Compositional Referring Expression Segmentation", CVPR, 2023 (Singapore University of Technology and Design). [Paper]
    • ReLA: "GRES: Generalized Referring Expression Segmentation", CVPR, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • CGFormer: "Contrastive Grouping With Transformer for Referring Image Segmentation", CVPR, 2023 (ShanghaiTech). [Paper][PyTorch]
    • CCTF: "Learning To Segment Every Referring Object Point by Point", CVPR, 2023 (JD). [Paper][Code (in construction)]
    • ETRIS: "Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation", ICCV, 2023 (Sun Yat-sen University). [Paper][PyTorch]
    • DMMI: "Beyond One-to-One: Rethinking the Referring Image Segmentation", ICCV, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • TRIS: "Referring Image Segmentation Using Text Supervision", ICCV, 2023 (CUHK). [Paper][Code (in construction)]
    • SnG: "Shatter and Gather: Learning Referring Image Segmentation with Text Supervision", ICCV, 2023 (POSTECH). [Paper]
    • VLT: "VLT: Vision-Language Transformer and Query Generation for Referring Segmentation", TPAMI, 2023 (NTU, Singapore). [Paper]
    • IREG: "Whether you can locate or not? Interactive Referring Expression Generation", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper][Code (in construction)]
    • R-RIS: "Towards Robust Referring Image Segmentation", arXiv, 2023 (Peking). [Paper][Code (in construction)][Website]
    • PVD: "Parallel Vertex Diffusion for Unified Visual Grounding", arXiv, 2023 (Peking University). [Paper]
    • MMNet: "MMNet: Multi-Mask Network for Referring Image Segmentation", arXiv, 2023 (CAS). [Paper]
    • LGFormer: "Linguistic Query-Guided Mask Generation for Referring Image Segmentation", arXiv, 2023 (Alibaba). [Paper]
    • RISCLIP: "RISCLIP: Referring Image Segmentation Framework using CLIP", arXiv, 2023 (POSTECH). [Paper]
    • EAVL: "EAVL: Explicitly Align Vision and Language for Referring Image Segmentation", arXiv, 2023 (CAS). [Paper]
    • Ref-Diff: "Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models", arXiv, 2023 (Harbin Institute of Technology). [Paper][Code (in construction)]
    • DuMoGa: "Towards Complex-query Referring Image Segmentation: A Novel Benchmark", arXiv, 2023 (NUS). [Paper]
  • Referring Video Segmentation:
    • ReferFormer: "Language as Queries for Referring Video Object Segmentation", CVPR, 2022 (HKU). [Paper][PyTorch]
    • MTTR: "End-to-End Referring Video Object Segmentation with Multimodal Transformers", CVPR, 2022 (Technion - Israel Institute of Technology). [Paper][PyTorch]
    • MANet: "Multi-Attention Network for Compressed Video Referring Object Segmentation", ACMMM, 2022 (CAS). [Paper][PyTorch]
    • R2VOS: "Towards Robust Referring Video Object Segmentation with Cyclic Relational Consensus", ICCV, 2023 (Microsoft). [Paper][PyTorch][Website]
    • OnlineRefer: "OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation", ICCV, 2023 (Megvii). [Paper][PyTorch]
    • SgMg: "Spectrum-guided Multi-granularity Referring Video Object Segmentation", ICCV, 2023 (The University of Western Australia). [Paper][PyTorch]
    • MeViS: "MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions", ICCV, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • CMA: "Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples", ICCV, 2023 (SUSTech). [Paper][PyTorch]
    • TempCD: "Temporal Collection and Distribution for Referring Video Object Segmentation", ICCV, 2023 (ShanghaiTech). [Paper][Website]
    • UniRef: "Segment Every Reference Object in Spatial and Temporal Spaces", ICCV, 2023 (HKU). [Paper]
    • HTML: "HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation", ICCV, 2023 (University of Technology Sydney, UTS). [Paper][Website]
    • SOC: "SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation", NeurIPS, 2023 (Tsinghua). [Paper][Code (in construction)]
    • Locater: "Local-Global Context Aware Transformer for Language-Guided Video Segmentation", TPAMI, 2023 (Zhejiang). [Paper][PyTorch]
    • LoSh: "LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation", arXiv, 2023 (King’s College London). [Paper]
    • RefSAM: "RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation", arXiv, 2023 (National University of Defense Technology, China). [Paper][Code (in construction)]
    • IFIRVOS: "Referring Video Object Segmentation with Inter-Frame Interaction and Cross-Modal Correlation", arXiv, 2023 (Wuhan University). [Paper]
    • LGCFS: "Learning Referring Video Object Segmentation from Weak Annotation", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • EPCFormer: "EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation", arXiv, 2023 (Hunan University). [Paper][Code (in construction)]
    • FTEA: "Fully Transformer-Equipped Architecture for End-to-End Referring Video Object Segmentation", arXiv, 2023 (Hangzhou Dianzi University). [Paper]
  • Referring 3D Segmentation:
    • 3D-STMN: "3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation", arXiv, 2023 (Xiamen University). [Paper][PyTorch]
  • Tracking:
    • ModaMixer: "Divert More Attention to Vision-Language Tracking", NeurIPS, 2022 (Beijing Jiaotong University). [Paper][PyTorch]
    • TransRMOT: "Referring Multi-Object Tracking", CVPR, 2023 (Megvii). [Paper][PyTorch][Website]
    • ModaMixer: "Divert More Attention to Vision-Language Object Tracking", arXiv, 2023 (Beijing Jiaotong University). [Paper][PyTorch]
  • Analysis:
    • MM-Explainability: "Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers", ICCV, 2021 (Tel Aviv). [Paper][PyTorch]
    • ?: "Are Multimodal Transformers Robust to Missing Modality?", CVPR, 2022 (University of Delaware). [Paper]
    • VL-InterpreT: "VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers", CVPR (demo), 2022 (Intel). [Paper][Website][Video]
    • ?: "Understanding Attention for Vision-and-Language Tasks", International Conference on Computational Linguistics (COLING), 2022 (The University of Sydney). [Paper]
    • VL-CheckList: "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations", arXiv, 2022 (Zhejiang University). [Paper][Code (in construction)]
    • ?: "Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding", CVPR, 2023 (Tel Aviv). [Paper][PyTorch][Website]
    • Why-Prompt: "Doubly Right Object Recognition: A Why Prompt for Visual Rationales", CVPR, 2023 (Columbia). [Paper]
    • CREPE: "CREPE: Can Vision-Language Foundation Models Reason Compositionally?", CVPR, 2023 (Stanford). [Paper]
    • ZOOM: "Zero-shot Model Diagnosis", CVPR, 2023 (CMU). [Paper]
    • ?: "On the Generalization of Multi-modal Contrastive Learning", ICML, 2023 (Peking). [Paper][PyTorch]
    • ?: "Learning Concise and Descriptive Attributes for Visual Recognition", ICCV, 2023 (UCSD). [Paper]
    • ?: "Interpreting CLIP's Image Representation via Text-Based Decomposition", arXiv, 2023 (Berkeley). [Paper][PyTorch][Website]
  • Speaker Localization:
    • ?: "The Right to Talk: An Audio-Visual Transformer Approach", ICCV, 2021 (University of Arkansas). [Paper]
  • Multi-task:
    • UniT: "Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer", ICCV, 2021 (Facebook). [Paper][PyTorch][Website]
    • Pix2Seq: "A Unified Sequence Interface for Vision Tasks", NeurIPS, 2022 (Google). [Paper]
    • LAVIS: "LAVIS: A Library for Language-Vision Intelligence", arXiv, 2022 (Salesforce). [Paper][PyTorch]
    • Unified-IO: "Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks", ICLR, 2023 (AI2). [Paper][JAX][Website]
    • ImageBind: "ImageBind: One Embedding Space To Bind Them All", CVPR, 2023 (Meta). [Paper][PyTorch][Website]
    • EgoT2: "Egocentric Video Task Translation", CVPR, 2023 (Meta). [Paper][Website]
    • VTAGML: "Vision Transformer Adapters for Generalizable Multitask Learning", ICCV, 2023 (EPFL). [Paper][Website]
    • CoCoCon: "Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models", arXiv, 2023 (AI2). [Paper][PyTorch][Website]
    • VisionLLM: "VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • ONE-PEACE: "ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities", arXiv, 2023 (Alibaba). [Paper][PyTorch (in construction)]
    • VideoLLM: "VideoLLM: Modeling Video Sequence with Large Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)]
    • i-Code-Studio: "i-Code Studio: A Configurable and Composable Framework for Integrative AI", arXiv, 2023 (Microsoft). [Paper][Code (in construction)][Website]
    • Tag2Text: "Tag2Text: Guiding Vision-Language Model via Image Tagging", arXiv, 2023 (OPPO). [Paper][PyTorch][Website]
    • RAM: "Recognize Anything: A Strong Image Tagging Model", arXiv, 2023 (OPPO). [Paper][PyTorch][Website]
    • InstructDiffusion: "InstructDiffusion: A Generalist Modeling Interface for Vision Tasks", arXiv, 2023 (Microsoft). [Paper][PyTorch][Website]
    • InstructCV: "InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists", arXiv, 2023 (Peking + Berkeley). [Paper][PyTorch]
  • Language-based Video Editing:
    • M3L: "Language-based Video Editing via Multi-Modal Multi-Level Transformer", CVPR, 2022 (UCSB). [Paper]
    • Video-P2P: "Video-P2P: Video Editing with Cross-attention Control", arXiv, 2023 (CUHK). [Paper][Website]
    • FateZero: "FateZero: Fusing Attentions for Zero-shot Text-based Video Editing", arXiv, 2023 (Tencent). [Paper][PyTorch][Website]
    • Make-A-Protagonist: "Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts", arXiv, 2023 (Huawei). [Paper][PyTorch][Website]
  • Video Summarization:
    • GPT2MVS: "GPT2MVS: Generative Pre-trained Transformer-2 for Multi-modal Video Summarization", ICMR, 2021 (BBC). [Paper]
    • QVHighlights: "QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries", NeurIPS, 2021 (UNC). [Paper][PyTorch]
    • HMT: "Hierarchical Multimodal Transformer to Summarize Videos", arXiv, 2021 (Xidian University). [Paper]
    • ?: "Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention", ACMMM, 2022 (Adobe). [Paper]
    • IV-Sum: "TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency", ECCV, 2022 (Google). [Paper][Website]
    • A2Summ: "Align and Attend: Multimodal Summarization with Dual Contrastive Losses", CVPR, 2023 (Adobe). [Paper][Code (in construction)][Website]
    • QD-DETR: "Query-Dependent Video Representation for Moment Retrieval and Highlight Detection", CVPR, 2023 (Sungkyunkwan University, Korea). [Paper][PyTorch]
    • A2Summ: "Align and Attend: Multimodal Summarization with Dual Contrastive Losses", CVPR, 2023 (Adobe). [Paper][PyTorch][Website]
    • CLC: "Collaborative Noisy Label Cleaner: Learning Scene-aware Trailers for Multi-modal Highlight Detection in Movies", CVPR, 2023 (Tencent). [Paper][Code (in construction)]
    • VideoXum: "VideoXum: Cross-modal Visual and Textural Summarization of Videos", arXiv, 2023 (OPPO). [Paper][Website]
    • MH-DETR: "MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer", arXiv, 2023 (Nanjing University). [Paper]
    • VisionaryVid: "Joint Moment Retrieval and Highlight Detection Via Natural Language Queries", arXiv, 2023 (Georgia Tech). [Paper][PyTorch]
  • Robotics:
    • CRT: "Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions", IROS, 2021 (Keio University). [Paper]
    • TraSeTR: "TraSeTR: Track-to-Segment Transformer with Contrastive Query for Instance-level Instrument Segmentation in Robotic Surgery", ICRA, 2022 (CUHK). [Paper]
    • VLMbench: "VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation", NeurIPS (Datasets and Benchmarks), 2022 (UC Santa Cruz). [Paper][Pytorch][Website]
    • Surgical-VQLA: "Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery", ICRA, 2023 (CUHK). [Paper][PyTorch]
    • ?: "Distilling Internet-Scale Vision-Language Models into Embodied Agents", ICML, 2023 (DeepMind). [Paper]
    • LIV: "LIV: Language-Image Representations and Rewards for Robotic Control", ICML, 2023 (UPenn). [Paper][PyTorch][Website]
    • PaLM-E: "PaLM-E: An Embodied Multimodal Language Model", ICML, 2023 (Google). [Paper][Website]
    • VIMA: "VIMA: General Robot Manipulation with Multimodal Prompts", ICML, 2023 (NVIDIA). [Paper][PyTorch][Website]
    • GVCCI: "GVCCI: Lifelong Learning of Visual Grounding for Language-Guided Robotic Manipulation", IROS, 2023 (SNU, Korea). [Paper]
    • LACO: "Language-Conditioned Path Planning", CoRL, 2023 (Berkeley). [Paper][Code (in construction)][Website]
    • Grounded-Decoding: "Grounded Decoding: Guiding Text Generation with Grounded Models for Robot Control", arXiv, 2023 (Google). [Paper][Website]
    • MOO: "Open-World Object Manipulation using Pre-trained Vision-Language Models", arXiv, 2023 (Google). [Paper][Website]
    • ?: "Vision-Language Models as Success Detectors", arXiv, 2023 (DeepMind). [Paper]
    • VC-1: "Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?", arXiv, 2023 (Meta). [Paper][Website]
    • HomeRobot: "HomeRobot: Open-Vocabulary Mobile Manipulation", arXiv, 2023 (Georgia Tech + Meta). [Paper][PyTorch][Website]
    • TaPA: "Embodied Task Planning with Large Language Models", arXiv, 2023 (Beijing University of Posts and Telecommunications). [Paper][PyTorch][Website]
    • VoxPoser: "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", arXiv, 2023 (Stanford). [Paper][Website]
    • RT-2: "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv, 2023 (DeepMind). [Paper][Website]
  • Multi-modal Fusion:
    • MICA: "Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion", ICCV, 2021 (Southwest Jiaotong University). [Paper]
    • IFT: "Image Fusion Transformer", arXiv, 2021 (Johns Hopkins). [Paper][PyTorch]
    • PPT: "PPT Fusion: Pyramid Patch Transformer for a Case Study in Image Fusion", arXiv, 2021 (?). [Paper]
    • TransFuse: "TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning", arXiv, 2022 (Fudan University). [Paper]
    • SwinFuse: "SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images", arXiv, 2022 (Taiyuan University of Science and Technology). [Paper]
    • ?: "Array Camera Image Fusion using Physics-Aware Transformers", arXiv, 2022 (University of Arizona). [Paper]
    • CDDFuse: "CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion", CVPR, 2023 (ETHZ). [Paper][PyTorch]
  • Human Interaction:
    • Dyadformer: "Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions", ICCVW, 2021 (Universitat de Barcelona). [Paper]
  • 3D:
    • 3DRefTransformer: "3DRefTransformer: Fine-Grained Object Identification in Real-World Scenes Using Natural Language", WACV, 2022 (KAUST). [Paper][Website]
    • EDA: "EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning", arXiv, 2022 (Peking University). [Paper]
    • PLA: "Language-driven Open-Vocabulary 3D Scene Understanding", CVPR, 2023 (ByteDance). [Paper][PyTorch][Website]
    • VL-SAT: "VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud", CVPR, 2023 (Beihang University). [Paper][PyTorch]
    • LERF: "LERF: Language Embedded Radiance Fields", ICCV, 2023 (Berkeley). [Paper][Website]
    • ConceptFusion: "ConceptFusion: Open-set Multimodal 3D Mapping", arXiv, 2023 (MIT). [Paper][Website]
    • CG3D: "CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition", arXiv, 2023 (JHU). [Paper][PyTorch][Website]
    • DiffCLIP: "DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification", arXiv, 2023 (Beijing Institute of Technology). [Paper]
    • LLM-Grounder: "LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent", arXiv, 2023 (UMich). [Paper][PyTorch][Website]
  • 3D Scene Understanding:
    • OpenScene: "OpenScene: 3D Scene Understanding with Open Vocabularies", CVPR, 2023 (Google). [Paper][PyTorch][Website]
    • PartSLIP: "PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models", CVPR, 2023 (Qualcomm). [Paper]
    • CLIP2Scene: "CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP", CVPR, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • PLA: "Language-driven Open-Vocabulary 3D Scene Understanding", CVPR, 2023 (ByteDance). [Paper][PyTorch][Website]
    • 3D-Highlighter: "3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions", CVPR, 2023 (University of Chicago). [Paper][PyTorch][Website]
    • OVSG: "Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs", CoRL, 2023 (Rutgers). [Paper][Code (in construction)]
    • CLIP-FO3D: "CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP", arXiv, 2023 (Tsinghua University). [Paper]
    • 3D-OVS: "3D Open-vocabulary Segmentation with Foundation Models", arXiv, 2023 (NTU, Singapore). [Paper][Code (in construction)]
    • OVO: "OVO: Open-Vocabulary Occupancy", arXiv, 2023 (Fudan). [Paper]
    • SAM3D: "SAM3D: Segment Anything in 3D Scenes", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Seal: "Segment Any Point Cloud Sequences by Distilling Vision Foundation Models", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch (in construction)]
    • OpenMask3D: "OpenMask3D: Open-Vocabulary 3D Instance Segmentation", arXiv, 2023 (ETHZ). [Paper][Website (in construction)]
    • Lowis3D: "Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding", arXiv, 2023 (HKU). [Paper]
    • OpenIns3D: "OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation", arXiv, 2023 (Cambridge). [Paper][Website]
    • ConceptGraphs: "ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning", arXiv, 2023 (1University of Toronto + Universite de Montreal). [Paper][PyTorch][Website]
  • Speech Recognition:
    • AV-HuBERT: "Robust Self-Supervised Audio-Visual Speech Recognition", arXiv, 2022 (Meta). [Paper][PyTorch]
    • ?: "Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition", arXiv, 2022 (Google). [Paper]
    • AVFormer: "AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR", CVPR, 2023 (Google). [Paper]
    • AV-RelScore: "Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring", CVPR, 2023 (KAIST). [Paper][PyTorch]
    • SynthVSR: "SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision", CVPR, 2023 (Meta). [Paper]
  • Emotion Recognition:
    • ?: "A Pre-trained Audio-Visual Transformer for Emotion Recognition", ICASSP, 2022 (USC). [Paper]
    • MDAN: "MDAN: Multi-level Dependent Attention Network for Visual Emotion Analysis", CVPR, 2022 (Tencent). [Paper]
    • DMD: "Decoupled Multimodal Distilling for Emotion Recognition", CVPR, 2023 (Nanjing University of Science and Technology). [Paper][PyTorch]
  • Sound Separation:
    • VoViT: "VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer", ECCV, 2022 (Universitat Pompeu Fabra, Spain). [Paper][PyTorch][Website]
    • iQuery: "iQuery: Instruments as Queries for Audio-Visual Sound Separation", CVPR, 2023 (UCSD). [Paper][Code (in construction)]
    • VAST: "Language-Guided Audio-Visual Source Separation via Trimodal Consistency", CVPR, 2023 (Boston University). [Paper][Website]
    • AVIN: "Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization", ACMMM, 2023 (Northwestern Polytechnical University). [Paper][Code (in construction)]
    • GAVS: "Prompting Segmentation with Sound is Generalizable Audio-Visual Source Localizer", arXiv, 2023 (Renmin University of China). [Paper]
  • Audio-Visual:
    • AV-HuBERT: "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction", ICLR, 2022 (Meta). [Paper][PyTorch]
    • AVCA: "Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language", CVPR, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • TCaF: "Temporal and cross-modal attention for audio-visual zero-shot learning", ECCV, 2022 (University of Tubingen, Germany). [Paper][PyTorch]
    • AVA-Memory: "Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment", ECCV, 2022 (KAIST). [Paper]
    • TVLT: "TVLT: Textless Vision-Language Transformer", NeurIPS, 2022 (UNC). [Paper][PyTorch]
    • ANGIE: "Audio-Driven Co-Speech Gesture Video Generation", NeurIPS, 2022 (CUHK). [Paper][Website]
    • MGN: "Multi-modal Grouping Network for Weakly-Supervised Audio-Visual Video Parsing", NeurIPS, 2022 (CMU + UT Austin). [Paper][PyTorch]
    • FS-RIR: "Few-Shot Audio-Visual Learning of Environment Acoustics", NeurIPS, 2022 (UT Austin). [Paper][Website]
    • u-HuBERT: "u-HuBERT: Unified Mixed-Modal Speech Pretraining And Zero-Shot Transfer to Unlabeled Modality", NeurIPS, 2022 (Meta). [Paper]
    • PC-VAE: "Multimodal Transformer for Parallel Concatenated Variational Autoencoders", NeurIPSW, 2022 (USC). [Paper]
    • AV-CAT: "Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers", SIGGRAPH Asia, 2022 (Tokyo Institute of Technology + Baidu). [Paper][Website]
    • Audiovisual-MAE: "Audiovisual Masked Autoencoders", arXiv, 2022 (Google). [Paper]
    • MTD: "Multimodal Transformer Distillation for Audio-Visual Synchronization", arXiv, 2022 (NTU). [Paper]
    • AVE-CLIP: "AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization", WACV, 2023 (UT Austin). [Paper]
    • CLIPSep: "CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos", ICLR, 2023 (Sony). [Paper]
    • CAV-MAE: "Contrastive Audio-Visual Masked Autoencoder", ICLR, 2023 (MIT + IBM). [Paper]
    • UnAV: "Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline", CVPR, 2023 (Southern University of Science and Technology). [Paper][PyTorch][Website]
    • LAVISH: "Vision Transformers are Parameter-Efficient Audio-Visual Learners", CVPR, 2023 (UNC). [Paper][Pytorch][Website]
    • OneAVM: "A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition", ICML, 2023 (CMU + UW Madison). [Paper][Code (in construction)]
    • AdVerb: "AdVerb: Visually Guided Audio Dereverberation", ICCV, 2023 (Maryland). [Paper][Website]
    • CIGN: "Class-Incremental Grouping Network for Continual Audio-Visual Learning", ICCV, 2023 (CMU). [Paper][PyTorch]
    • MAViL: "MAViL: Masked Audio-Video Learners", NeurIPS, 2023 (Meta). [Paper][Code (in construction)]
    • GestureDiffuCLIP: "GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents", arXiv, 2023 (Peking University). [Paper]
    • MMViT: "MMViT: Multiscale Multiview Vision Transformers", arXiv, 2023 (Meta). [Paper]
    • ?: "Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos" arXiv, 2023 (Meta). [Paper]
  • Audio-Visual Localization/Segmentation:
    • AVSBench: "Audio-Visual Segmentation", ECCV, 2022 (SenseTime). [Paper][PyTorch][Website]
    • AV-SAM: "AV-SAM: Segment Anything Model Meets Audio-Visual Localization and Segmentation", arXiv, 2023 (CMU + UT Dallas). [Paper]
    • AUSS: "Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation", arXiv, 2023 (Fudan). [Paper]
    • AuTR: "Annotation-free Audio-Visual Segmentation", arXiv, 2023 (Shanghai Jiao Tong). [Paper]
    • AVSegFormer: "AVSegFormer: Audio-Visual Segmentation with Transformer", arXiv, 2023 (Nanjing University). [Paper][PyTorch]
    • SQD: "Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition", arXiv, 2023 (CMU). [Paper]
    • DiffMAViL: "Diffusion Models as Masked Audio-Video Learners", arXiv, 2023 (Apple). [Paper]
  • Audio Description:
  • Sound Localization:
    • TURN: "Towards Effective Multi-Modal Interchanges in Zero-Resource Sounding Object Localization", NeurIPS, 2022 (Zhejiang University). [Paper][PyTorch (in construction)]
    • AVGN: "Audio-Visual Grouping Network for Sound Localization from Mixtures", CVPR, 2023 (CMU). [Paper][PyTorch]
  • Sentiment Analysis:
    • CubeMLP: "CubeMLP: A MLP-based Model for Multimodal Sentiment Analysis and Depression Estimation", ACMMM, 2022 (Zhejiang University). [Paper]
    • MCMulT: "Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos", arXiv, 2022 (Tencent). [Paper]
  • Name Entity Recognition:
    • FMIT: "Flat Multi-modal Interaction Transformer for Named Entity Recognition", International Conference on Computational Linguistics (COLING), 2022 (South China University of Technology). [Paper]
  • Localization via Embodied Dialog:
    • LED-Bert: "Transformer-based Localization from Embodied Dialog with Large-scale Pre-training", arXiv, 2022 (Georgia Tech). [Paper]
  • Object Captioning:
    • GRiT: "GRiT: A Generative Region-to-text Transformer for Object Understanding", arXiv, 2022 (Microsoft). [Paper][PyTorch]
  • Conversation:
    • VisProg: "Visual Programming: Compositional visual reasoning without training", CVPR, 2023 (AI2). [Paper][PyTorch][Website]
    • LaVIN: "Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models", NeurIPS, 2023 (Xiamen University). [Paper][PyTorch][Website]
    • Visual-ChatGPT: "Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models", arXiv, 2023 (Microsoft). [Paper]
    • MM-REACT: "MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action", arXiv, 2023 (Microsoft). [Paper][Code][Website]
    • Video-ChatCaptioner: "Video ChatCaptioner: Towards the Enriched Spatiotemporal Descriptions", arXiv, 2023 (KAUST). [Paper][PyTorch]
    • Chameleon: "Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models", arXiv, 2023 (UCLA + Microsoft). [Paper][PyTorch][Website]
    • MiniGPT-4: "MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models", arXiv, 2023 (KAUST). [Paper][PyTorch][Website]
    • ChatVideo: "ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System", arXiv, 2023 (Fudan). [Paper][Website]
    • LLaMA-Adapter: "LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • LLaMA-Adapter-V2: "LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • Otter: "Otter: A Multi-Modal Model with In-Context Instruction Tuning", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch]
    • LMEye: "LMEye: An Interactive Perception Network for Large Language Models", arXiv, 2023 (Meituan). [Paper]
    • MultiModal-GPT: "MultiModal-GPT: A Vision and Language Model for Dialogue with Humans", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • InternChat: "InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • VideoChat: "VideoChat: Chat-Centric Video Understanding", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • InstructBLIP: "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning", arXiv, 2023 (Salesforce). [Paper][PyTorch]
    • ArtGPT-4: "ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4", arXiv, 2023 (Anhui Polytechnic University). [Paper][PyTorch]
    • EmbodiedGPT: "EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought", arXiv, 2023 (HKU). [Paper][PyTorch (in construction)][Website]
    • PandaGPT: "PandaGPT: One Model To Instruction-Follow Them All", arXiv, 2023 (Tencent). [Paper][PyTorch][Website]
    • Video-LLaMA: "Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding", arXiv, 2023 (Alibaba). [Paper][PyTorch]
    • MIMIC-IT: "MIMIC-IT: Multi-Modal In-Context Instruction Tuning", arXiv, 2023 (NTU, Singapore). [Paper][PyTorch][Website]
    • Video-ChatGPT: "Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models", arXiv, 2023 (MBZUAI). [Paper][PyTorch]
    • LAMM: "LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • ?: "Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models", arXiv, 2023 (Huawei). [Paper]
    • AssistGPT: "AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn", arXiv, 2023 (NUS). [Paper][Code (in construction)][Website]
    • Macaw-LLM: "Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration", arXiv, 2023 (Tencent). [Paper][PyTorch]
    • Shikra: "Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic", arXiv, 2023 (SenseTime). [Paper][Code (in construction)]
    • LLaVAR: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding", arXiv, 2023 (Stanford). [Paper][PyTorch][Website]
    • Polite-Flamingo: "Visual Instruction Tuning with Polite Flamingo", arXiv, 2023 (Xiaobing.AI). [Paper]
    • Lynx: "What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?", arXiv, 2023 (ByteDance). [Paper][Website]
    • GPT4RoI: "GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • SVIT: "SVIT: Scaling up Visual Instruction Tuning", arXiv, 2023 (BAAI). [Paper]
    • AmadeusGPT: "AmadeusGPT: a natural language interface for interactive animal behavioral analysis", arXiv, 2023 (EPFL). [Paper][Code (in construction)]
    • ChatSpot: "ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning", arXiv, 2023 (Megvii). [Paper][Demo]
    • 3D-LLM: "3D-LLM: Injecting the 3D World into Large Language Models", arXiv, 2023 (UCLA). [Paper][PyTorch (in construction)][Website]
    • ?: "How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges", arXiv, 2023 (ETHZ). [Paper][GitHub (in construction)]
    • MovieChat: "MovieChat: From Dense Token to Sparse Memory for Long Video Understanding", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
    • AntGPT: "AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?", arXiv, 2023 (Brown). [Paper][Website]
    • ?: "Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models", arXiv, 2023 (Google). [Paper]
    • MM-Vet: "MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities", arXiv, 2023 (Microsoft). [Paper][Code]
    • Chat-3D: "Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes", arXiv, 2023 (Zhejiang University). [Paper][PyTorch][Website]
    • LLaVA: "Visual Instruction Tuning", arXiv, 2023 (UW-Madison). [Paper][PyTorch][Website]
    • StableLLaVA: "StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • PVIT: "Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models", arXiv, 2023 (Tsinghua). [Paper]
    • PointLLM: "PointLLM: Empowering Large Language Models to Understand Point Clouds", arXiv, 2023 (Shanghai AI Lab). [Paper][Code (in construction)][Website]
    • Point-Bind: "Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following", arXiv, 2023 (CUHK). [Paper][PyTorch]
    • ImageBind-LLM: "ImageBind-LLM: Multi-modality Instruction Tuning", arXiv, 2023 (Shanghai AI Lab). [Paper]
    • ?: "An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models", arXiv, 2023 (Microsoft). [Paper][GitHub]
    • InternLM-XComposer: "InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch]
    • LLaVA-RLHF: "Aligning Large Multimodal Models with Factually Augmented RLHF", arXiv, 2023 (Berkeley + CMU + UIUC). [Paper][Code (in construction)][Website]
    • AnyMAL: "AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model", arXiv, 2023 (Meta). [Paper]
    • Muffin: "Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants", arXiv, 2023 (Tsinghua). [Paper]
    • Pink: "Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs", arXiv, 2023 (Ant). [Paper][Code (in construction)]
    • LLaVA-1.5: "Improved Baselines with Visual Instruction Tuning", arXiv, 2023 (UW Madison). [Paper][PyTorch][Website]
    • MiniGPT-5: "MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens", arXiv, 2023 (UC Santa Cruz). [Paper][PyTorch]
    • Ferret: "Ferret: Refer and Ground Anything Anywhere at Any Granularity", arXiv, 2023 (Apple). [Paper][Code (in construction)]
  • Visual Reasoning:
    • BDC-Adapter: "BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning", BMVC, 2023 (SUSTech). [Paper]
    • RPT: "Fine-Grained Regional Prompt Tuning for Visual Abductive Reasoning", arXiv, 2023 (A*STAR). [Paper]
    • LRR: "Look, Remember and Reason: Visual Reasoning with Grounded Rationales", arXiv, 2023 (Qualcomm). [Paper]
    • SDS-CLIP: "Augmenting CLIP with Improved Visio-Linguistic Reasoning", arXiv, 2023 (Maryland). [Paper]
    • ?: "Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models", arXiv, 2023 (George Mason University). [Paper]
    • ViCor: "ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models", arXiv, 2023 (UC Santa Cruz). [Paper]
  • Tracking:
    • JointNLT: "Joint Visual Grounding and Tracking with Natural Language Specification", CVPR, 2023 (Harbin Institute of Technology). [Paper][PyTorch]
    • MMTrack: "Towards Unified Token Learning for Vision-Language Tracking", arXiv, 2023 (Guangxi Normal University). [Paper]
  • Scene Graph:
    • CaCao: "Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World", arXiv, 2023 (Zhejiang University). [Paper]
  • Egocentric Video:
    • MMG-Ego4D: "MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition", CVPR, 2023 (Meta). [Paper]
    • EgoTV: "EgoTV: Egocentric Task Verification from Natural Language Task Descriptions", arXiv, 2023 (Meta). [Paper]
  • Dance Generation:
  • Conceptual Understanding:
    • ?: "Text-To-Concept (and Back) via Cross-Model Alignment", ICML, 2023 (Maryland). [Paper]
    • ?: "Probing Conceptual Understanding of Large Visual-Language Models", arXiv, 2023 (UCF + SRI). [Paper]
    • EAC: "Explain Any Concept: Segment Anything Meets Concept-Based Explanation", arXiv, 2023 (HKUST). [Paper]
  • Model Merging:
    • VL-merging: "An Empirical Study of Multimodal Model Merging", arXiv, 2023 (Microsoft). [Paper][PyTorch]
  • Visual Word Sense Disambiguation (VWSD):
    • CADG: "Vision Meets Definitions: Unsupervised Visual Word Sense Disambiguation Incorporating Gloss Information", ACL, 2023 (UMass). [Paper]
  • Object Hallucination:
    • POPE: "Evaluating Object Hallucination in Large Vision-Language Models", arXiv, 2023 (Renmin University of China). [Paper][Code (in construction)]
  • Social Interaction:
    • HIINT: "HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer", arXiv, 2023 (MIT). [Paper]
  • Evaluation:
    • Perception-Test: "Perception Test: A Diagnostic Benchmark for Multimodal Video Models", arXiv, 2023 (DeepMind). [Paper][GitHub]
    • VLM-Probing: "Scalable Performance Analysis for Vision-Language Models", Joint Conference on Lexical and Computational Semantics (*SEM), 2023 (UMich). [Paper][PyTorch]
    • VisualGPTScore: "VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores", arXiv, 2023 (CMU). [Paper][Code (in construction)][Website]
    • LVLM-eHub: "LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch (in construction)]
    • VisoGender: "VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution", arXiv, 2023 (Oxford). [Paper][PyTorch]
    • MME: "MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models", arXiv, 2023 (Tencent). [Paper][Code (in construction)]
    • MMBench: "MMBench: Is Your Multi-modal Model an All-around Player?", arXiv, 2023 (Shanghai AI Lab). [Paper][Website]
    • Tiny-LVLM-eHub: "Tiny LVLM-eHub: Early Multimodal Experiments with Bard", arXiv, 2023 (Shanghai AI Lab). [Paper][PyTorch][Website]
    • VisIT-Bench: "VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use", arXiv, 2023 (UW). [Paper][Website]
    • MODE: "An Examination of the Compositionality of Large Generative Vision-Language Models", arXiv, 2023 (HKUST). [Paper]
    • TouchStone: "TouchStone: Evaluating Vision-Language Models by Language Models", arXiv, 2023 (Alibaba). [Paper]
    • Q-Bench: "Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision", arXiv, 2023 (NTU, Singapore). [Paper]
    • PCA-EVAL: "Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond", arXiv, 2023 (Peking). [Paper][Code (in construction)]
    • ReForm-Eval: "ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks", arXiv, 2023 (Fudan). [Paper]
  • Robustness:
    • Hierarchy-CLIP: "Improving Zero-shot Generalization and Robustness of Multi-modal Models", CVPR, 2023 (Google). [Paper][JAX][Website]
    • ?: "Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning", ICML, 2023 (UCLA). [Paper]
    • SGA: "Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models", ICCV, 2023 (Southern University of Science and Technology). [Paper]
    • VLAttack: "VLAttack: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models", NeurIPS, 2023 (Pennsylvania State University). [Paper]
    • AttackVLM: "On Evaluating Adversarial Robustness of Large Vision-Language Models", arXiv, 2023 (Singapore University of Technology and Design (SUTD)). [Paper][PyTorch (in construction)]
  • Compositional Reasoning:
    • DAC: "Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models", arXiv, 2023 (IBM). [Paper]
    • SugarCrepe: "SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality", arXiv, 2023 (AI2). [Paper][PyTorch]
  • Vocabulary-free Image Classification (VIC):
    • CaSED: "Vocabulary-free Image Classification", arXiv, 2023 (University of Trento, Italy). [Paper][PyTorch]
  • Retrieval Augmentated Methods:
    • ?: "Improving Image Recognition by Retrieving from Web-Scale Image-Text Data", CVPR, 2023 (Google). [Paper]
  • NeRF:
    • NeRDi: "NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors", CVPR, 2023 (Waymo). [Paper]
  • Model Selection:
    • LOVM: "LOVM: Language-Only Vision Model Selection", arXiv, 2023 (Stanford). [Paper]
  • Multimodal Interaction:
    • ?: "Learning Unseen Modality Interaction", arXiv, 2023 (University of Amsterdam). [Paper]
  • Multimodal Translation:
    • CLIPTrans: "CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation", ICCV, 2023 (Boston College). [Paper][PyTorch]
  • Noisy label detection:
    • VDC: "VDC: Versatile Data Cleanser for Detecting Dirty Samples via Visual-Linguistic Inconsistency", arXiv, 2023 (CUHK). [Paper]
  • Model Compression:
    • ECoFLaP: "ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models", arXiv, 2023 (UNC). [Paper][PyTorch][Website]

[Back to Overview]