Efficient-Transformers-in-Vision: A Survey

A detailed survey focus on recent efficient transformers for mainstream computer vision tasks.

Welcome to comment/contribute!

Keep updated.

Resource

Google Parti [Page], [Code]
Google Imagen [Page], [Paper]
DeepMind Gato: A Generalist Agent, [Paper]
Google PaLM: Scaling Language Modeling with Pathways, [Paper]
OpenAI DALL·E 2 [Page], [Paper]
SCENIC: A JAX Library for Computer Vision Research and Beyond, [Code]
V-L joint learning study (with good tables): [METER], [Kaleido-BERT]
Attention is all you need, [Paper]
OpenAI CLIP [Page], [Paper], [Code], [arXiv]
OpenAI DALL·E [Page], [Code], [Paper]
huggingface/transformers
Kyubyong/transformer, TF
jadore801120/attention-is-all-you-need-pytorch, Torch
krasserm/fairseq-image-captioning
PyTorch Transformers Tutorials
ictnlp/awesome-transformer
basicv8vc/awesome-transformer
dk-liang/Awesome-Visual-Transformer
yuewang-cuhk/awesome-vision-language-pretraining-papers

Survey

(arXic 2022.06) Multimodal Learning with Transformers: A Survey, [Paper]
(arXic 2022.05) Vision Transformer: Vit and its Derivatives, [Paper]
(arXiv 2022.05) Transformers in 3D Point Clouds: A Survey, [Paper]
(arXiv 2022.04) Visual Attention Methods in Deep Learning: An In-Depth Survey, [Paper]
(arXiv 2022.04) Vision-and-Language Pretrained Models: A Survey, [Paper]
(arXiv 2022.03) A Roadmap for Big Model, [Paper]
(arXiv 2022.03) Transformers Meet Visual Learning Understanding: A Comprehensive Review, [[Paper]](https://arxiv.org/pdf/2203.12944.pdf）
(arXiv 2022.03) Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work, [Paper], [Project]
(arXiv 2022.02) A Survey of Vision-Language Pre-Trained Models, [Paper]
(arXiv 2022.02) VLP: A Survey on Vision-Language Pre-training, [Paper]
(arXiv 2022.02) Transformer for Graphs: An Overview from Architecture Perspective, [Paper]
(arXiv 2022.01) Video Transformers: A Survey, [Paper]
(arXiv 2021.11) ARE WE READY FOR A NEW PARADIGM SHIFT? A SURVEY ON VISUAL DEEP MLP, [Paper]
(arXiv 2021.11) A Survey of Visual Transformers, [Paper]
(arXiv 2021.09) Survey: Transformer based Video-Language Pre-training, [Paper]
(arXiv 2021.06) A Survey of Transformers, [Paper]
(arXiv 2021.06) Attention mechanisms and deep learning for machine vision: A survey of the state of the art, [Paper]
(arXiv 2021.06) Pre-Trained Models: Past, Present and Future, [Paper]
(arXiv 2021.05) Can Attention Enable MLPs To Catch Up With CNNs? [Paper]
(arXiv 2021.03) A Practical Survey on Faster and Lighter Transformers, [Paper]
(arXiv 2021.03) Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision, [Paper]
(arXiv 2021.01) A Survey on Visual Transformer, [Paper]
(arXiv 2020.9) Efficient Transformers: A Survey, [Paper]
(arXiv 2020.1) Transformers in Vision: A Survey, [Paper]

Recent Papers

2022.07

(arXiv 2022.07) Distance Matters in Human-Object Interaction Detection, [Paper]

2022.06

(arXiv 2022.06) (Efficient Transformer)Video2StyleGAN: Encoding Video in Latent Space for Manipulation, [Paper]
(arXiv 2022.06) Text-Driven Stylization of Video Objects, [Paper], [Project]
(arXiv 2022.06) Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization, [Paper], [Code]
(arXiv 2022.06) CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation, [Paper]
(arXiv 2022.06) Towards Adversarial Attack on Vision-Language Pre-training Models, [Paper]
(arXiv 2022.06) CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2022.06) VISUALIZING AND UNDERSTANDING SELF-SUPERVISED VISION LEARNING, [Paper], [Code]
(arXiv 2022.06) VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection, [Paper]
(arXiv 2022.06) Bear the Query in Mind: Visual Grounding with Query-conditioned Convolution, [Paper]
(arXiv 2022.06) DALL-E for Detection: Language-driven Context Image Synthesis for Object Detection, [Paper]
(arXiv 2022.06) REVECA – Rich Encoder-decoder framework for Video Event CAptioner, [Paper], [Code]
(arXiv 2022.06) SAViR-T: Spatially Attentive** Visual Reasoning** with Transformers, [Paper]
(arXiv 2022.06) (Efficient Transformer)EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm, [Paper], [Code]
(arXiv 2022.06) DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations, [Paper]
(arXiv 2022.06) Capturing and Inferring Dense Full-Body Human-Scene Contact, [Paper], [Project]
(arXiv 2022.06) M&M Mix: A Multimodal Multiview Transformer Ensemble, [Paper]
(arXiv 2022.06) DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment, [Paper]
(arXiv 2022.06) Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds, [Paper], [Code]
(arXiv 2022.06) Global Context Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Counting Varying Density Crowds Through Density Guided Adaptive Selection CNN and Transformer Estimation, [Paper]
(arXiv 2022.06) One-stage Action Detection Transformer, [Paper]
(arXiv 2022.06) SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders, [Paper]
(arXiv 2022.06) (Efficient Transformer)TRANSFORMER-BASED MULTI-MODAL PROPOSAL AND RE-RANK FOR WIKIPEDIA IMAGE-CAPTION MATCHING, [Paper], [Code]
(arXiv 2022.06) (Efficient Transformer)Vicinity Vision Transformer, [Paper], [Code]
(arXiv 2022.06) (Efficient Transformer)EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications, [Paper], [Code]
(arXiv 2022.06) Temporally Consistent Semantic Video Editing, [Paper]
(arXiv 2022.06) VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation, [Paper]
(arXiv 2022.06) MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge, [Paper], [Project]
(arXiv 2022.06) IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes, [Paper], [Code]
(arXiv 2022.06) Backdoor Attacks on Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Rectify ViT Shortcut Learning by Visual Saliency, [Paper]
(arXiv 2022.06) Learning Using Privileged Information for Zero-Shot Action Recognition, [Paper]
(arXiv 2022.06) Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning, [Paper], [Code]
(arXiv 2022.06) CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer, [Paper], [Project]
(arXiv 2022.06) (Efficient Transformer)SimA: Simple Softmax-free Attention for Vision Transformers, [Paper], [Code]
(arXiv 2022.06) UNIFIED-IO: A UNIFIED MODEL FOR VISION, LANGUAGE, AND MULTI-MODAL TASKS, [Paper], [Project]
(arXiv 2022.06) VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix, [Paper], [Code]
(arXiv 2022.06) ReLER@ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022, [Paper]
(arXiv 2022.06) Video + CLIP Baseline for Ego4D Long-term Action Anticipation, [Paper], [Code]
(arXiv 2022.06) What makes domain generalization hard?, [Paper]
(arXiv 2022.06) SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos, [Paper], [Code]
(arXiv 2022.06) Disentangling visual and written concepts in CLIP, [Paper], [Project]
(arXiv 2022.06) Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos, [Paper]
(arXiv 2022.06) Patch-level Representation Learning for Self-supervised Vision Transformers, [Paper]
(arXiv 2022.06) Zero-Shot Video Question Answering via Frozen Bidirectional Language Models, [Paper], [Code]
(arXiv 2022.06) OmniMAE: Single Model Masked Pretraining on Images and Videos, [Paper], [Code]
(arXiv 2022.06) Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency, [Paper], [Code]
(arXiv 2022.06) LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling, [Paper], [Code]
(arXiv 2022.06) Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World, [Paper]
(arXiv 2022.06) Rethinking Generalization in Few-Shot Classification, [Paper], [Code]
(arXiv 2022.06) VCT: A Video Compression Transformer, [Paper]
(arXiv 2022.06) Forecasting of depth and ego-motion with transformers and self-supervision, [Paper]
(arXiv 2022.06) Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone, [Paper], [Code]
(arXiv 2022.06) (Efficient Transformer)SP-ViT: Learning 2D Spatial Priors for Vision Transformers, [Paper]
(arXiv 2022.06) A Simple Data Mixing Prior for Improving Self-Supervised Learning, [Paper], [Code]
(arXiv 2022.06) Prefix Language Models are Unified Modal Learners, [Paper], [Code]
(arXiv 2022.06) Masked Frequency Modeling for Self-Supervised Visual Pre-Training, [Paper], [Code]](https://www.mmlab-ntu.com/project/mfm/index.html)
(arXiv 2022.06) Generalizable Neural Radiance Fields for Novel View Synthesis with Transformer, [Paper]
(arXiv 2022.06) A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training, [Paper]
(arXiv 2022.06) Learning to Estimate Shapley Values with Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Graph-based Spatial Transformer with Memory Replay for Multi-future Pedestrian Trajectory Prediction, [Paper], [Code]
(arXiv 2022.06) GLIPv2: Unifying Localization and VL Understanding, [Paper], [Code]
(arXiv 2022.06) INDIGO: Intrinsic Multimodality for Domain Generalization, [Paper]
(arXiv 2022.06) TRANSDUCTIVE CLIP WITH CLASS-CONDITIONAL CONTRASTIVE LEARNING, [Paper]
(arXiv 2022.06) SILVER-BULLET-3D AT MANISKILL 2021: LEARNING-FROM-DEMONSTRATIONS AND HEURISTIC RULE-BASED METHODS FOR OBJECT MANIPULATION, [Paper], [Code]
(arXiv 2022.06) MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing, [Paper], [Code]
(arXiv 2022.06) Visual Transformer for Object Detection, [Paper]
(arXiv 2022.06) Bringing **Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens, [Paper], [Code]
(arXiv 2022.06) TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer, [Paper]
(arXiv 2022.06) ReCo: Retrieve and Co-segment for Zero-shot Transfer, [Paper], [Project]
(arXiv 2022.06) MAREO: MEMORY- AND ATTENTION- BASED VISUAL REASONING, [Paper]
(arXiv 2022.06) Recurrent Transformer Variational Autoencoders for Multi-Action Motion Synthesis, [Paper]
(arXiv 2022.06) Object Scene Representation Transformer, [Paper]
(arXiv 2022.06) Comprehending and Ordering Semantics for Image Captioning, [Paper], [Code]
(arXiv 2022.06) Exploring Adversarial Attacks and Defenses in Vision Transformers trained with DINO, [Paper]
(arXiv 2022.06) (Efficient Transformer)Peripheral Vision Transformer, [Paper], [Code]
(arXiv 2022.06) Efficient Decoder-free Object Detection with Transformers, [Paper], [Code]
(arXiv 2022.06) Prototypical Contrastive Language Image Pretraining, [Paper], [Code]
(arXiv 2022.06) SpA-Former:Transformer image** shadow detection and removal** via spatial attention, [Paper], [Code]
(arXiv 2022.06) A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers, [Paper]
(arXiv 2022.06) Can Foundation Models Talk Causality? [Paper]
(arXiv 2022.06) Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space, [Paper], [Code]
(arXiv 2022.06) MaskViT: Masked Visual Pre-Training for Video Prediction, [Paper]
(arXiv 2022.06) PromptPose: Language Prompt Helps Animal Pose Estimation, [Paper]
(arXiv 2022.06) Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos, [Paper]
(arXiv 2022.06) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.06) Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation, [Paper]
(arXiv 2022.06) Position Labels for Self-Supervised Vision Transformer, [Paper]
(arXiv 2022.06) Exploring Feature Self-relation for Self-supervised Transformer, [Paper]
(arXiv 2022.06) Patch-based Object-centric Transformers for Efficient Video Generation, [Paper], [Code]
(arXiv 2022.06) Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners, [Paper], [Code]
(arXiv 2022.06) (Efficient Transformer)VN-Transformer: Rotation-Equivariant Attention for Vector Neurons, [Paper]
(arXiv 2022.06) CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes, [Paper], [Code]
(arXiv 2022.06) OOD Augmentation May Be at Odds with Open-Set Recognition, [Paper]
(arXiv 2022.06) Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer, [Paper]
(arXiv 2022.06) (Efficient Transformer)cycle text2face: cycle text-to-face gan via transformers, [Paper]
(arXiv 2022.06) Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer, [Paper], [Code]
(arXiv 2022.06) Transformer based Urdu Handwritten Text Optical Character Reader, [Paper]
(arXiv 2022.06) Spatial Entropy Regularization for Vision Transformers, [Paper]
(arXiv 2022.06) On Data Scaling in Masked Image Modeling, [Paper]
(arXiv 2022.06) Extreme Masking for Learning Instance and Distributed Visual Representations, [Paper]
(arXiv 2022.06) GateHUB: Gated History Unit with Background Suppression for Online Action Detection, [Paper]
(arXiv 2022.06) Anomaly detection in surveillance videos using transformer based attention model, [Paper], [Code]
(arXiv 2022.06) ContraCLIP: Interpretable GAN generation driven by pairs of contrasting sentences, [Paper], [Code]
(arXiv 2022.06)(Efficent Transformer) EAANet: Efficient Attention Augmented Convolutional Networks, [Paper]
(arXiv 2022.06) Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning, [Paper]
(arXiv 2022.06) (Efficient Transformer)Recurrent Video Restoration Transformer with Guided Deformable Attention, [Paper], [Code]
(arXiv 2022.06) Rethinking the Openness of CLIP, [Paper]
(arXiv 2022.06) OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression, [Paper]
(arXiv 2022.06) Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval, [Paper]
(arXiv 2022.06) CONTRASTIVE GRAPH MULTIMODAL MODEL FOR TEXT CLASSIFICATION IN VIDEOS, [Paper]
(arXiv 2022.06) (Efficient Transformer)Separable Self-attention for Mobile Vision Transformers, [Paper], [Code]
(arXiv 2022.06) Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation, [Paper], [Code]
(arXiv 2022.06) Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts, [Paper]
(arXiv 2022.06) cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation, [Paper]
(arXiv 2022.06) Masked Unsupervised Self-training for Zero-shot Image Classification, [Paper], [Code]
(arXiv 2022.06) DETR++: Taming Your Multi-Scale Detection Transformer, [Paper]
(arXiv 2022.06) Structured Context Transformer for Generic Event Boundary Detection, [Paper]
(arXiv 2022.06) Revealing Single Frame Bias for Video-and-Language Learning, [Paper], [Code]
(arXiv 2022.06) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2022.06) Can CNNs Be More Robust Than Transformers? [Paper], [Code]
(arXiv 2022.06) Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding, [Paper]
(CVPR 2022) Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation, [Paper]
(arXiv 2022.06) A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge, [Paper], [Project]
(arXiv 2022.06) Revisiting the “Video” in Video-Language Understanding, [Paper], [Project]
(arXiv 2022.06) Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction, [Paper]
(arXiv 2022.06) Modeling Image Composition for Complex Scene Generation, [Paper], [Code]
(arXiv 2022.06) Unified Recurrence Modeling for Video Action Anticipation, [Paper]
(arXiv 2022.06) Prefix Conditioning Unifies Language and Label Supervision, [Paper]
(arXiv 2022.06) Optimizing Relevance Maps of Vision Transformers Improves Robustness, [Paper], [Code]
(arXiv 2022.06) VL-BEIT: Generative Vision-Language Pretraining, [Paper], [Code]
(arXiv 2022.06) (Efficient Transformer)EfficientFormer: Vision Transformers at MobileNet Speed, [Paper], [Code]
(arXiv 2022.06) REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering, [Paper]
(arXiv 2022.06) Siamese Image Modeling for Self-Supervised Vision Representation Learning, [Paper]
(CVPR 2022) Distillation Using Oracle Queries for Transformer-based Human-Object nteraction Detection, [Paper], [Code]
(CVPR 2022) Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection, [Paper], [Code]
(CVPR 2022) Human Trajectory Prediction with Momentary Observation, [Paper]
(arXiv 2022.06) (Efficient Transformer)Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer, [Paper]
(arXiv 2022.06) Unifying Voxel-based Representation with Transformer for 3D Object Detection, [Paper], [Code]
(arXiv 2022.06) Extreme Floorplan Reconstruction by Structure-Hallucinating Transformer Cascades, [Paper]
(arXiv 2022.06) Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training, [Paper]
(arXiv 2022.06) VALHALLA: Visual Hallucination for Machine Translation, [Paper], [Code]
(arXiv 2022.06) Learning Sequential Contexts using Transformer for 3D Hand Pose Estimation, [Paper]
(arXiv 2022.06) CLIP4IDC: CLIP for Image Difference Captioning, [Paper], [Code]
(arXiv 2022.06) Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment, [Paper]
(arXiv 2022.06) Vision GNN: An Image is Worth Graph of Nodes, [Paper], [Code]
(arXiv 2022.06) Weakly-supervised Action Transition Learning for Stochastic Human Motion Prediction, [Paper], [Code]
(arXiv 2022.06) TubeFormer-DeepLab: Video Mask Transformer, [Paper]
(arXiv 2022.06) Video-based Human-Object Interaction Detection from Tubelet Tokens, [Paper]

2022.05

(arXiv 2022.05) HeatER: An Efficient and Unified Network for Human Reconstruction via Heatmap-based TransformER, [Paper]
(arXiv 2022.05) Robotic grasp detection based on Transformer, [Paper]
(arXiv 2022.05) Multimodal Masked Autoencoders Learn Transferable Representations, [Paper]
(arXiv 2022.05) Multimodal Fake News Detection via CLIP-Guided Learning, [Paper]
(arXiv 2022.05) WT-MVSNet: Window-based Transformers for Multi-view Stereo, [Paper]
(arXiv 2022.05) Object-wise Masked Autoencoders for Fast Pre-training, [Paper]
(arXiv 2022.05) A Closer Look at Self-supervised Lightweight Vision Transformers, [Paper]
(arXiv 2022.05) Variational Transformer: A Framework Beyond the Trade-off between Accuracy and Diversity for Image Captioning, [Paper]
(arXiv 2022.05) CYCLIP: Cyclic Contrastive Language-Image Pretraining, [Paper], [Code]
(arXiv 2022.05) MDMLP: Image Classification from Scratch on Small Datasets with MLP, [Paper], [Code]
(arXiv 2022.05) SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners, [Paper], [Code]
(arXiv 2022.05) 3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction, [Paper]
(arXiv 2022.05) Prompt-aligned Gradient for Prompt Tuning, [Paper], [Code]
(arXiv 2022.05) Illumination Adaptive Transformer, [Paper], [Code]
(arXiv 2022.05) HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling, [Paper]
(arXiv 2022.05) GMML is All you Need, [Paper], [Code]
(arXiv 2022.05) COMPLETEDT: POINT CLOUD COMPLETION WITH DENSE AUGMENT INFERENCE TRANSFORMERS, [Paper]
(arXiv 2022.05) Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks, [Paper]
(arXiv 2022.05) VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models, [Paper], [Benchmark], [Code]
(arXiv 2022.05) Architecture-Agnostic Masked Image Modeling – From ViT back to CNN, [Paper]
(arXiv 2022.05) Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, [Paper], [Code]
(arXiv 2022.05) GIT: A Generative Image-to-text Transformer for Vision and Language, [Paper]
(arXiv 2022.05) 3DILG: Irregular Latent Grids for 3D Generative Modeling, [Paper]
(arXiv 2022.05) Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos, [Paper], [Code]
(arXiv 2022.05) Future Transformer for Long-term Action Anticipation, [Paper], [Project]
(arXiv 2022.05) X-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2022.05) Knowledge Distillation via the Target-aware Transformer, [Paper]
(arXiv 2022.05) Dynamic Query Selection for Fast Visual Perceiver, [Paper]
(arXiv 2022.05) MonoFormer: Towards Generalization of self-supervised monocular depth estimation with Transformers, [Paper]
(arXiv 2022.05) PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models, [Paper], [Code]
(arXiv 2022.05) Supporting Vision-Language Model Inference with Causality-pruning Knowledge Prompt, [Paper]
(arXiv 2022.05) Super Vision Transformer, [Paper], [Code]
(arXiv 2022.05) mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections, [Paper]
(arXiv 2022.05) VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering, [Paper]
(arXiv 2022.05) UMSNet: An Universal Multi-sensor Network for Human Activity Recognition, [Paper]
(arXiv 2022.05) Privacy-Preserving Image Classification Using Vision Transformer, [Paper]
(arXiv 2022.05) HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval, [Paper]
(arXiv 2022.05) ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions, [Paper], [Code]
(arXiv 2022.05) HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding, [Paper]
(arXiv 2022.05) Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning, [Paper]
(arXiv 2022.05) Degradation-Aware Unfolding Half-Shuffle Transformer for Spectral Compressive Imaging, [Paper]
(arXiv 2022.05) Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality, [Paper], [Code]
(arXiv 2022.05) Visual Concepts Tokenization, [Paper]
(arXiv 2022.05) MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion, [Paper]
(arXiv 2022.05) CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers., [Paper], [Code]
(arXiv 2022.05) Evidence for Hypodescent in Visual Semantic AI, [Paper]
(arXiv 2022.05) Boosting Camouflaged Object Detection with Dual-Task Interactive Transformer, [Paper], [Code]
(arXiv 2022.05) muNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-tuning Multitask Systems, [Paper]
(arXiv 2022.05) Large Language Models are Zero-Shot Reasoners, [Paper]
(arXiv 2022.05) AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition, [Paper], [Code]
(arXiv 2022.05) Green Hierarchical Vision Transformer for Masked Image Modeling, [Paper], [Code]
(arXiv 2022.05) Efficient U-Transformer with Boundary-Aware Loss for Action Segmentation, [Paper]
(arXiv 2022.05) Cross-Architecture Self-supervised Video Representation Learning, [Paper], [Code]
(arXiv 2022.05) Prompt-based Learning for Unpaired Image Captioning, [Paper]
(arXiv 2022.05) MixMIM: Mixed and Masked Image Modeling for Efficient Visual Representation Learning, [Paper], [Code]
(arXiv 2022.05) Fast Vision Transformers with HiLo Attention, [Paper], [Code]
(arXiv 2022.05) Fine-grained Image Captioning with CLIP Reward, [Paper], [Code]
(arXiv 2022.05) Mutual Information Divergence: A Unified Metric for Multimodal Generative Models, [Paper]
(arXiv 2022.05) MoCoViT: Mobile Convolutional Vision Transformer, [Paper]
(arXiv 2022.05) AO2-DETR: Arbitrary-Oriented Object Detection Transformer, [Paper]
(arXiv 2022.05) Inception Transformer, [Paper], [Code]
(arXiv 2022.05) VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation, [Paper]
(arXiv 2022.05) UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes, [Paper]
(arXiv 2022.05) Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners, [Paper], [Code]
(arXiv 2022.05) Training Vision-Language Transformers from Captions Alone, [Paper], [Code]
(arXiv 2022.05) Voxel-informed Language Grounding, [Paper], [Code]
(arXiv 2022.05) Cross-Enhancement Transformer for Action Segmentation, [Paper]
(arXiv 2022.05) TRT-ViT: TensorRT-oriented Vision Transformer, [Paper]
(arXiv 2022.05) Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection, [Paper]
(arXiv 2022.05) A graph-transformer for whole slide image classification, [Paper]
(arXiv 2022.05) VNT-Net: Rotational Invariant Vector Neuron Transformers, [Paper]
(arXiv 2022.05) Masked Image Modeling with Denoising Contrast, [Paper]
(arXiv 2022.05) Cross-subject Action Unit Detection with Meta Learning and Transformer-based Relation Modeling, [Paper]
(arXiv 2022.05) Masked Autoencoders As Spatiotemporal Learners, [Paper]
(arXiv 2022.05) BodyMap: Learning Full-Body Dense Correspondence Map, [Paper], [Code]
(arXiv 2022.05) Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers, [Paper]
(arXiv 2022.05) AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars, [Paper]
(arXiv 2022.05) Vision Transformer Adapter for Dense Predictions, [Paper], [Code]
(arXiv 2022.05) Demo: Real-Time Semantic Communications with a Vision Transformer, [Paper]
(arXiv 2022.05) MulT: An End-to-End Multitask Learning Transformer, [Paper], [Code]
(arXiv 2022.05) A CLIP-Hitchhiker’s Guide to Long Video Retrieval, [Paper]
(arXiv 2022.05) Video Frame Interpolation with Transformer, [Paper], [Code]
(arXiv 2022.05) Dense residual Transformer for Image Denoising, [Paper]
(arXiv 2022.05) Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT, [Paper]
(arXiv 2022.05) Robot Cooking with Stir-fry: Bimanual Non-prehensile Manipulation of Semi-fluid Objects, [Paper]
(arXiv 2022.05) Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos, [Paper], [Code]
(arXiv 2022.05) Learning to Retrieve Videos by Asking Questions, [Paper]
(arXiv 2022.05) One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code, [Paper]
(arXiv 2022.05) Simple Open-Vocabulary Object Detection with Vision Transformers, [Paper], [Code]
(arXiv 2022.05) AggPose: Deep Aggregation Vision Transformer for Infant Pose Estimation, [Paper], [Code]
(arXiv 2022.05) An Empirical Study of Self-supervised Learning Approaches for Object Detection with Transformers, [Paper], [Code-DETR], [Code-Deform-DETR]
(arXiv 2022.05) Reduce Information Loss in Transformers for Pluralistic Image Inpainting, [Paper], [Code]
(arXiv 2022.05) Transformer-based Cross-Modal Recipe Embeddings with Large Batch Training, [Paper]
(arXiv 2022.05) Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in the Wild, [Paper]
(arXiv 2022.05) Generalizable Task Planning through Representation Pretraining, [Paper], [Project]
(arXiv 2022.05) EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers, [Paper]
(arXiv 2022.05) Activating More Pixels in Image Super-Resolution Transformer, [Paper], [Code]
(arXiv 2022.05) Row-wise Accelerator for Vision Transformer, [Paper]
(arXiv 2022.05) SparseTT: Visual Tracking with Sparse Transformers, [Paper], [Code]
(arXiv 2022.05) RoViST: Learning Robust Metrics for Visual Storytelling, [Paper], [Code]
(arXiv 2022.05) Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection, [Paper]
(arXiv 2022.05) Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering, [Paper]
(arXiv 2022.05) Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning, [Paper]
(arXiv 2022.05) ConvMAE: Masked Convolution Meets Masked Autoencoders, [Paper], [Code]
(arXiv 2022.05) Cross-lingual Adaptation for Recipe Retrieval with Mixup, [Paper]
(arXiv 2022.05) Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework, [Paper]
(arXiv 2022.05) Transformer Tracking with Cyclic Shifting Window Attention, [Paper], [Code]
(arXiv 2022.05) Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning, [Paper]
(arXiv 2022.05) Prompt Distribution Learning, [Paper]
(arXiv 2022.05) CLIP-CLOP: CLIP-Guided Collage and Photomontage, [Paper]
(arXiv 2022.05) Dual-Level Decoupled Transformer for Video Captioning, [Paper]
(arXiv 2022.05) Declaration-based Prompt Tuning for Visual Question Answering, [Paper], [Code]
(arXiv 2022.05) P^3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision, [Paper]
(arXiv 2022.05) Language Models Can See: Plugging Visual Controls in Text Generation, [Paper], [Code]
(arXiv 2022.05) YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression, [Paper]
(arXiv 2022.05) Cross-view Transformers for real-time Map-view Semantic Segmentation, [Paper], [Code]
(arXiv 2022.05) i-Code: An Integrative and Composable Multimodal Learning Framework, [Paper]
(arXiv 2022.05) Visual Commonsense in Pretrained Unimodal and Multimodal Models, [Paper], [Project]
(arXiv 2022.05) Dual Cross-Attention Learning for Fine-Grained Visual Categorization and Object Re-Identification, [Paper]
(arXiv 2022.05) RecipeSnap - a lightweight image to recipe model, [Paper], [Code]
(arXiv 2022.05) CoCa: Contrastive Captioners are Image-Text Foundation Models, [Paper]
(arXiv 2022.05) Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP), [Paper]
(arXiv 2022.05) Cross-modal Representation Learning for Zero-shot Action Recognition, [Paper], [Code]
(arXiv 2022.05) Cross-Domain Object Detection with Mean-Teacher Transformer, [Paper]
(arXiv 2022.05) Better plain ViT baselines for ImageNet-1k, [Paper], [Code]
(arXiv 2022.05) Reinforced Swin-Convs Transformer for Underwater Image Enhancement, [Paper]
(arXiv 2022.05) UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog, [Paper]
(arXiv 2022.05) Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering, [Paper]
(arXiv 2022.05) CenterCLIP: Token Clustering for Efficient Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.05) Arbitrary Shape Text Detection via Boundary Transformer, [Paper], [Code]
(arXiv 2022.05) HULC: 3D Human Motion Capture with Pose Manifold Sampling and Dense Contact Guidance, [Paper], [Project]

2022.04

(arXiv 2022.04) Learn to Understand Negation in Video Retrieval, [Paper]
(arXiv 2022.04) LayoutBERT: Masked Language Layout Model for Object Insertion, [Paper]
(arXiv 2022.04) Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, [Paper], [Code]
(arXiv 2022.04) Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer, [Paper]
(arXiv 2022.04) SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation, [Paper]
(arXiv 2022.04) Where in the World is this Image? Transformer-based Geo-localization in the Wild, [Paper]
(arXiv 2022.04) Depth Estimation with Simplified Transformer, [Paper]
(arXiv 2022.04) A very preliminary analysis of DALL-E 2, [Paper]
(arXiv 2022.04) CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers, [Paper], [Code]
(arXiv 2022.04) CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification, [Paper], [Code]
(arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]
(arXiv 2022.04) PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining, [Paper]
(arXiv 2022.04) Symmetric Transformer-based Network for Unsupervised Image Registration, [Paper], [Code]
(arXiv 2022.04) Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos, [Paper], [Code]
(arXiv 2022.04) CapOnImage: Context-driven Dense-Captioning on Image, [Paper]
(arXiv 2022.04) Self-Supervised Learning of Object Parts for Semantic Segmentation, [Paper], [Code]
(arXiv 2022.04) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers, [Paper]
(arXiv 2022.04) CATrans: Context and Affinity Transformer for Few-Shot Segmentation, [Paper]
(arXiv 2022.04) Self-Driving Car Steering Angle Prediction: Let Transformer Be a Car Again, [Paper], [Code]
(arXiv 2022.04) ClothFormer: Taming Video Virtual Try-on in All Module, [Paper]
(arXiv 2022.04) Deeper Insights into ViTs Robustness towards Common Corruptions, [Paper]
(arXiv 2022.04) VITPOSE: SIMPLE VISION TRANSFORMER BASELINES FOR HUMAN POSE ESTIMATION, [Paper], [Code]
(arXiv 2022.04) Understanding The Robustness in Vision Transformers, [Paper], [Code]
(arXiv 2022.04) MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval, [Paper]
(arXiv 2022.04) Contrastive Language-Action Pre-training for Temporal Localization, [Paper]
(arXiv 2022.04) Boosting Adversarial Transferability of MLP-Mixer, [Paper]
(arXiv 2022.04) Adaptive Split-Fusion Transformer, [Paper], [Code]
(arXiv 2022.04) Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation? [Paper], [Project]
(arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]
(arXiv 2022.04) VISTA: Vision Transformer enhanced by U-Net and Image Colorfulness Frame Filtration for Automatic Retail Checkout, [Paper], [Code]
(arXiv 2022.04) CLIP-DISSECT: AUTOMATIC DESCRIPTION OF NEURON REPRESENTATIONS IN DEEP VISION NETWORKS, [Paper]
(arXiv 2022.04) TEMOS: Generating diverse human motions from textual descriptions, [Paper], [Project]
(arXiv 2022.04) Unsupervised Hierarchical Semantic Segmentation with Multiview Cosegmentation and Clustering Transformers, [Paper]
(arXiv 2022.04) SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images, [Paper], [Code]
(arXiv 2022.04) OCFormer: One-Class Transformer Network for Image Classification, [Paper]
(arXiv 2022.04) DRT: A Lightweight Single Image Deraining Recursive Transformer, [Paper], [Code]
(arXiv 2022.04) Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering, [Paper], [Code]
(arXiv 2022.04) ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer, [Paper]
(arXiv 2022.04) iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition, [Paper], [Code]
(arXiv 2022.04) DIVERSE INSTANCE DISCOVERY: VISION-TRANSFORMER FOR INSTANCE-AWARE MULTI-LABEL IMAGE RECOGNITION, [Paper]
(arXiv 2022.04) Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds, [Paper], [Code]
(arXiv 2022.04) DFAM-DETR: Deformable feature based attention mechanism DETR on slender object detection, [Paper]
(arXiv 2022.04) NFormer: Robust Person Re-identification with Neighbor Transformer, [Paper], [Code]
(arXiv 2022.04) Video Moment Retrieval from Text Queries via Single Frame Annotation, [Paper]
(arXiv 2022.04) GIMO: Gaze-Informed Human Motion Prediction in Context, [Paper]
(arXiv 2022.04) VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance, [Paper]
(arXiv 2022.04) Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2022.04) Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer, [Paper], [Code]
(arXiv 2022.04) Multimodal Token Fusion for Vision Transformers, [Paper]
(arXiv 2022.04) Self-Calibrated Efficient Transformer for Lightweight Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Searching Intrinsic Dimensions of Vision Transformers, [Paper]
(arXiv 2022.04) Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks, [Paper]
(arXiv 2022.04) Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting, [Paper]
(arXiv 2022.04) Multi-Frame Self-Supervised Depth with Transformers, [Paper], [Code]
(arXiv 2022.04) MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction, [Paper], [Code]
(arXiv 2022.04) Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis, [Paper], [Code]
(arXiv 2022.04) An Extendable, Efficient and Effective Transformer-based Object Detector, [Paper], [Code]
(arXiv 2022.04) VDTR: Video Deblurring with Transformer, [Paper], [Code]
(arXiv 2022.04) BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment, [Paper], [Code]
(arXiv 2022.04) Temporally Efficient Vision Transformer for Video Instance Segmentation, [Paper], [Code]
(arXiv 2022.04) VSA: Learning Varied-Size Window Attention in Vision Transformers, [Paper], [Code]
(arXiv 2022.04) XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems to Improve Language Understanding, [Paper]
(arXiv 2022.04) IMPROVING CROSS-MODAL UNDERSTANDING IN VISUAL DIALOG VIA CONTRASTIVE LEARNING, [Paper]
(arXiv 2022.04) MVSTER: Epipolar Transformer for Efficient Multi-View Stereo, [Paper], [Code]
(arXiv 2022.04) UNCONDITIONAL IMAGE-TEXT PAIR GENERATION WITH MULTIMODAL CROSS QUANTIZER, [Paper]
(arXiv 2022.04) Pushing the Limits of Simple Pipelines for Few-Shot Learning: External Data and Fine-Tuning Make a Difference, [Paper]
(arXiv 2022.04) COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval, [Paper]
(arXiv 2022.04) Image Captioning In the Transformer Age, [Paper], [Code]
(arXiv 2022.04) ResT V2: Simpler, Faster and Stronger, [Paper], [Code]
(arXiv 2022.04) Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer, [Paper], [Code]
(arXiv 2022.04) Temporal Progressive Attention for Early Action Prediction, [Paper], [Code]
(arXiv 2022.04) Keep the Caption Information: Preventing Shortcut Learning in Contrastive Image-Caption Retrieval, [Paper]
(arXiv 2022.04) Flamingo: a Visual Language Model for Few-Shot Learning, [Paper]
(arXiv 2022.04) RELVIT: CONCEPT-GUIDED VISION TRANSFORMER FOR VISUAL RELATIONAL REASONING, [Paper]
(arXiv 2022.04) Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance, [Paper], [Code]
(arXiv 2022.04) Learning Future Object Prediction with a Spatiotemporal Detection Transformer, [Paper]
(arXiv 2022.04) R^2-Trans: Fine-Grained Visual Categorization with Redundancy Reduction, [Paper], [Code]
(arXiv 2022.04) A New Dataset and Transformer for Stereoscopic Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Transformer-Guided Convolutional Neural Network for Cross-View Geolocalization, [Paper]
(arXiv 2022.04) Multi-Scale Features and Parallel Transformers Based Image Quality Assessment, [Paper], [Code]
(arXiv 2022.04) BTranspose: Bottleneck Transformers for Human Pose Estimation with Self-Supervised Pre-Training, [Paper]
(arXiv 2022.04) Human-Object Interaction Detection via Disentangled Transformer, [Paper]
(arXiv 2022.04) ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models, [Paper]
(arXiv 2022.04) Interactiveness Field in Human-Object Interactions, [Paper], [Code]
(arXiv 2022.04) DeiT III: Revenge of the ViT, [Paper]
(arXiv 2022.04) Residual Swin Transformer Channel Attention Network for Image Demosaicing, [Paper]
(arXiv 2022.04) Neighborhood Attention Transformer, [Paper], [Code]
(arXiv 2022.04) MiniViT: Compressing Vision Transformers with Weight Multiplexing, [Paper], [Code]
(arXiv 2022.04) ViTOL: Vision Transformer for Weakly Supervised Object Localization, [Paper], [Code]
(arXiv 2022.04) What Matters in Language Conditioned Robotic Imitation Learning, [Paper], [Code]
(arXiv 2022.04) Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes, [Paper]
(arXiv 2022.04) ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension, [Paper]
(arXiv 2022.04) Are Multimodal Transformers Robust to Missing Modality? [Paper]
(arXiv 2022.04) TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation, [Paper], [Code]
(arXiv 2022.04) X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks, [Paper]
(arXiv 2022.04) Event Transformer, [Paper]
(arXiv 2022.04) Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels, [Paper]
(arXiv 2022.04) ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation, [Paper], [Code]
(arXiv 2022.04) Multimodal Transformer for Nursing Activity Recognition, [Paper], [Code]
(arXiv 2022.04) Robust Cross-Modal Representation Learning with Progressive Self-Distillation, [Paper]
(arXiv 2022.04) Stripformer: Strip Transformer for Fast Image Deblurring, [Paper]
(arXiv 2022.04) No Token Left Behind: Explainability-Aided Image Classification and Generation, [Paper]
(arXiv 2022.04) Fashionformer: A Simple, Effective and Unified Baseline for Human Fashion Segmentation and Recognition, [Paper], [Code]
(arXiv 2022.04) Panoptic-PartFormer: Learning a Unified Model for Panoptic Part Segmentation, [Paper], [Code]
(arXiv 2022.04) DILEMMA: Self-Supervised Shape and Texture Learning with Transformers, [Paper]
(arXiv 2022.04) Learning Trajectory-Aware Transformer for Video Super-Resolution, [Paper], [Code]
(arXiv 2022.04) Learning to Induce Causal Structure, [Paper]
(arXiv 2022.04) Consistency Learning via Decoding Path Augmentation for Transformers in Human Object Interaction Detection, [Paper], [Code]
(arXiv 2022.04) Category-Aware Transformer Network for Better Human-Object Interaction Detection, [Paper]
(arXiv 2022.04) Does Robustness on ImageNet Transfer to Downstream Tasks?, [Paper]
(arXiv 2022.04) POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition, [Paper], [Code]
(arXiv 2022.04) Vision Transformers for Single Image Dehazing, [Paper], [Code]
(arXiv 2022.04) Underwater Image Enhancement Using Pre-trained Transformer, [Paper]
(arXiv 2022.04) Event Transformer. A sparse-aware solution for efficient event data processing, [Paper], [Code]
(arXiv 2022.04) PSTR: End-to-End One-Step Person Search With Transformers, [Paper], [Code]
(arXiv 2022.04) Adapting CLIP For Phrase Localization Without Further Training, [Paper], [Code]
(arXiv 2022.04) FineDiving: A Fine-grained Dataset for Procedure-aware Action Quality Assessment, [Paper], [Project]
(arXiv 2022.04) DaViT: Dual Attention Vision Transformers, [Paper], [Code]
(arXiv 2022.04) Unsupervised Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.04) Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer, [Paper], [Project]
(arXiv 2022.04) Unified Contrastive Learning in Image-Text-Label Space, [Paper], [Code]
(arXiv 2022.04) HunYuan_tvr for Text-Video Retrivial, [Paper]
(arXiv 2022.04) LEARNING TO COMPOSE SOFT PROMPTS FOR COMPOSITIONAL ZERO-SHOT LEARNING, [Paper]
(arXiv 2022.04) End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.04) Temporal Alignment Networks for Long-term Video, [Paper], [Code]
(arXiv 2022.04) Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection, [Paper], [Code]
(arXiv 2022.04) MixFormer: Mixing Features across Windows and Dimensions, [Paper], [Code]
(arXiv 2022.04) CM3: A CAUSAL MASKED MULTIMODAL MODEL OF THE INTERNET, [Paper]
(arXiv 2022.04) DO AS I CAN, NOT AS I SAY: GROUNDING LANGUAGE IN ROBOTIC AFFORDANCES, [Paper], [Project]
(arXiv 2022.04) TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization, [Paper], [Code]
(arXiv 2022.04) Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, [Paper], [Project]
(arXiv 2022.04) Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition, [Paper]
(arXiv 2022.04) Learning Audio-Video Modalities from Image Captions, [Paper]
(arXiv 2022.04) Improving Vision Transformers by Revisiting High-frequency Components, [Paper]
(arXiv 2022.04) POS-BERT: Point Cloud One-Stage BERT Pre-Training, [Paper], [Code]
(arXiv 2022.04) BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation, [Paper], [Code]
(arXiv 2022.04) BatchFormerV2: Exploring Sample Relationships for Dense Representation Learning, [Paper]
(arXiv 2022.04) TransRAC: Encoding Multi-scale Temporal Correlation with Transformers for Repetitive Action Counting, [Paper]
(arXiv 2022.04) Long Movie Clip Classification with State-Space Video Models, [Paper], [Code]
(arXiv 2022.04) TALLFormer: Temporal Action Localization with Long-memory Transformer, [Paper], [Code]
(arXiv 2022.04) MultiMAE: Multi-modal Multi-task Masked Autoencoders, [Paper], [Project]
(arXiv 2022.04) “This is my unicorn, Fluffy”: Personalizing frozen vision-language representations, [Paper]
(arXiv 2022.04) SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space, [Paper]
(arXiv 2022.04) Multi-View Transformer for 3D Visual Grounding, [Paper], [Code]
(arXiv 2022.04) VISION TRANSFORMER EQUIPPED WITH NEURAL RESIZER ON FACIAL EXPRESSION RECOGNITION TASK, [Paper]
(arXiv 2022.04) Dual-AI: Dual-path Actor Interaction Learning for Group Activity Recognition, [Paper], [Project]
(arXiv 2022.04) Detector-Free Weakly Supervised Group Activity Recognition, [Paper]
(arXiv 2022.04) Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos, [Paper], [Project]
(arXiv 2022.04) What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions, [Paper]
(arXiv 2022.04) MaxViT: Multi-Axis Vision Transformer, [Paper]

2022.03

(arXiv 2022.03) Spatial-Temporal Parallel Transformer for Arm-Hand Dynamic Estimation, [Paper]
(arXiv 2022.03) ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval, [Paper]
(arXiv 2022.03) ReSTR: Convolution-free Referring Image Segmentation Using Transformers, [Paper], [Project]
(arXiv 2022.03) CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation, [Paper]
(arXiv 2022.03) Deformable Video Transformer, [Paper]
(arXiv 2022.03) End-to-End Trajectory Distribution Prediction Based on Occupancy Grid Maps, [Paper]
(arXiv 2022.03) CRAFT: Cross-Attentional Flow Transformer for Robust Optical Flow, [Paper], [Code]
(arXiv 2022.03) VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers, [Paper], [App]
(arXiv 2022.03) TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing, [Paper], [Code]
(arXiv 2022.03) BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers, [Paper], [Code]
(arXiv 2022.03) Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models, [Paper], [Code]
(arXiv 2022.03) Bringing Old Films Back to Life, [Paper], [Code]
(arXiv 2022.03) Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model, [Paper], [Code]
(arXiv 2022.03) SeqTR: A Simple yet Universal Network for Visual Grounding, [Paper], [Code]
(arXiv 2022.03) InstaFormer: Instance-Aware Image-to-Image Translation with Transformer, [Paper]
(arXiv 2022.03) Omni-DETR: Omni-Supervised Object Detection with Transformers, [Paper], [Code]
(arXiv 2022.03) Learning Program Representations for Food Images and Cooking Recipes, [Paper], [Project]
(arXiv 2022.03) ITTR: Unpaired Image-to-Image Translation with Transformers, [Paper]
(arXiv 2022.03) VPTR: Efficient Transformers for Video Prediction, [Paper], [Code]
(arXiv 2022.03) Parameter-efficient Fine-tuning for Vision Transformers, [Paper]
(arXiv 2022.03) TubeDETR: Spatio-Temporal Video Grounding with Transformers, [Paper], [Code]
(arXiv 2022.03) Exploring Plain Vision Transformer Backbones for Object Detection, [Paper]
(arXiv 2022.03) PROMPTDET: EXPAND YOUR DETECTOR VOCABULARY WITH UNCURATED IMAGES, [Paper], [Code]
(arXiv 2022.03) Few-Shot Object Detection with Fully Cross-Transformer, [Paper]
(arXiv 2022.03) Unified Transformer Tracker for Object Tracking, [Paper]
(arXiv 2022.03) X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval, [Paper], [Code]
(arXiv 2022.03) Fine-tuning Image Transformers using Learnable Memory, [Paper]
(arXiv 2022.03) MAT: Mask-Aware Transformer for Large Hole Image Inpainting, [Paper], [Code]
(arXiv 2022.03) mc-BEiT: Multi-choice Discretization for Image BERT Pre-training, [Paper]
(arXiv 2022.03) End-to-End Transformer Based Model for Image Captioning, [Paper]
(arXiv 2022.03) Hybrid Routing Transformer for Zero-Shot Learning, [Paper]
(arXiv 2022.03) TREATMENT LEARNING TRANSFORMER FOR NOISY IMAGE CLASSIFICATION, [Paper]
(arXiv 2022.03) Do Vision-Language Pretrained Models Learn Primitive Concepts?, [Paper]
(arXiv 2022.03) Transformer Inertial Poser: Attention-based Real-time Human Motion Reconstruction from Sparse IMUs, [Paper]
(arXiv 2022.03) SepViT: Separable Vision Transformer, [Paper]
(arXiv 2022.03) MatteFormer: Transformer-Based Image Matting via Prior-Tokens, [Paper], [Code]
(arXiv 2022.03) Feature Selective Transformer for Semantic Image Segmentation, [Paper]
(arXiv 2022.03) Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos, [Paper], [Code]
(arXiv 2022.03) RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution, [Paper], [Code]
(arXiv 2022.03) Single-Stream Multi-Level Alignment for Vision-Language Pretraining, [Paper]
(arXiv 2022.03) Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Collaborative Transformers for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2022.03) Object Memory Transformer for Object Goal Navigation, [Paper]
(arXiv 2022.03) Brain-inspired Multilayer Perceptron with Spiking Neurons, [Paper], [Code]
(arXiv 2022.03) HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network, [Paper], [Code]
(arXiv 2022.03) REGTR: End-to-end Point Cloud Correspondences with Transformers, [Paper], [Code]
(arXiv 2022.03) Automated Progressive Learning for Efficient Training of Vision Transformers, [Paper]
(arXiv 2022.03) Stratified Transformer for 3D Point Cloud Segmentation, [Paper], [Code]
(arXiv 2022.03) NOC-REK: Novel Object Captioning with Retrieved Vocabulary from External Knowledge, [Paper]
(arXiv 2022.03) FACIAL EXPRESSION RECOGNITION WITH SWIN TRANSFORMER, [Paper]
(arXiv 2022.03) Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness, [Paper]
(arXiv 2022.03) Efficient Visual Tracking via Hierarchical Cross-Attention Transformer, [Paper], [Code]
(arXiv 2022.03) High-Performance Transformer Tracking, [Paper], [Code]
(arXiv 2022.03) RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers, [Paper]
(arXiv 2022.03) Multi-modal Multi-label Facial Action Unit Detection with Transformer, [Paper]
(arXiv 2022.03) MonoDETR: Depth-aware Transformer for Monocular 3D Object Detection, [Paper], [Code]
(arXiv 2022.03) Text to Mesh Without 3D Supervision Using Limit Subdivision, [Paper], [Project]
(arXiv 2022.03) GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection, [Paper], [Code]
(arXiv 2022.03) CrossFormer: Cross Spatio-Temporal Transformer for 3D Human Pose Estimation, [Paper]
(arXiv 2022.03) FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks, [Paper], [Code]
(arXiv 2022.03) Vision Transformer Compression with Structured Pruning and Low Rank Approximation, [Paper]
(arXiv 2022.03) Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, [Paper]
(arXiv 2022.03) MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection, [Paper]
(arXiv 2022.03) Learning Patch-to-Cluster Attention in Vision Transformer, [Paper]
(arXiv 2022.03) Visual Prompt Tuning, [Paper]
(arXiv 2022.03) Training-free Transformer Architecture Search, [Paper]
(arXiv 2022.03) VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training, [Paper], [Code]
(arXiv 2022.03) METAMORPH: LEARNING UNIVERSAL CONTROLLERS WITH TRANSFORMERS, [Paper], [Project]
(arXiv 2022.03) A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning, [Paper]
(arXiv 2022.03) Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers, [Paper], [Project]
(arXiv 2022.03) Associating Objects with Scalable Transformers for Video Object Segmentation, [Paper], [[Project]](https://github.com/z-x-yang/AOT0
(arXiv 2022.03) HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation, [Paper], [Code]
(arXiv 2022.03) Learning to generate line drawings that convey geometry and semantics, [Paper], [Project]
(arXiv 2022.03) UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection, [Paper], [Code]
(arXiv 2022.03) AIMusicGuru: Music Assisted Human Pose Correction, [Paper]
(arXiv 2022.03) What to Hide from Your Students: Attention-Guided Masked Image Modeling, [Paper]
(arXiv 2022.03) Towards Efficient and Elastic Visual Question Answering with Doubly Slimmable Transformer, [Paper]
(arXiv 2022.03) ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator, [Paper]
(arXiv 2022.03) Keypoints Tracking via Transformer Networks, [Paper], [Code]
(arXiv 2022.03) Beyond Fixation: Dynamic Window Visual Transformer, [Paper], [Code]
(arXiv 2022.03) Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors, [Paper]
(arXiv 2022.03) Self-supervised Video-centralised Transformer for Video Face Clustering, [Paper]
(arXiv 2022.03) Towards Exemplar-Free Continual Learning in Vision Transformers: an Account of Attention, Functional and Weight Regularization, [Paper]
(arXiv 2022.03) Global Tracking Transformers, [Paper], [Code]
(arXiv 2022.03) Video Instance Segmentation via Multi-scale Spatio-temporal Split Attention Transformer, [Paper], [Code]
(arXiv 2022.03) QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation, [Paper]
(arXiv 2022.03) Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos, [Paper], [Project]
(arXiv 2022.03) GradViT: Gradient Inversion of Vision Transformers, [Paper], [Code]
(arXiv 2022.03) Mask Usage Recognition using Vision Transformer with Transfer Learning and Data Augmentation, [Paper]
(arXiv 2022.03) Under the Hood of Transformer Networks for Trajectory Forecasting, [Paper]
(arXiv 2022.03) Open-Vocabulary DETR with Conditional Matching, [Paper]
(arXiv 2022.03) Meta-attention for ViT-backed Continual Learning, [Paper], [Code]
(arXiv 2022.03) CNNs and Transformers Perceive Hybrid Images Similar to Humans, [Paper], [Code]
(arXiv 2022.03) Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory, [Paper], [Code]
(arXiv 2022.03) Affective Feedback Synthesis Towards Multimodal Text and Image Data, [Paper]
(arXiv 2022.03) ViewFormer: NeRF-free Neural Rendering from Few Images Using Transformers, [Paper]
(arXiv 2022.03) CLIP on Wheels: Zero-Shot Object Navigation as Object Localization and Exploration, [Paper]
(arXiv 2022.03) Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds, [Paper], [Code]
(arXiv 2022.03) HIPA: Hierarchical Patch Transformer for Single Image Super Resolution, [Paper]
(arXiv 2022.03) DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition, [Paper], [Code]
(arXiv 2022.03) MixFormer: End-to-End Tracking with Iterative Mixed Attention, [Paper], [Code]
(arXiv 2022.03) PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark, [Paper], [Code]
(arXiv 2022.03) Relationformer: A Unified Framework for Image-to-Graph Generation, [Paper], [Code]
(arXiv 2022.03) CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning, [Paper], [Code]
(arXiv 2022.03) Hyperbolic Vision Transformers: Combining Improvements in Metric Learning, [Paper], [Code]
(arXiv 2022.03) MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer, [Paper], [Code]
(arXiv 2022.03) Transformer-based HTR for Historical Documents, [Paper]
(arXiv 2022.03) simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers, [Paper], [Code]
(arXiv 2022.03) End-to-End Human-Gaze-Target Detection with Transformers, [Paper]
(arXiv 2022.03) End-to-End Video Text Spotting with Transformer, [Paper], [Code]
(arXiv 2022.03) Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation, [Paper], [Code]
(arXiv 2022.03) V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer, [Paper]
(arXiv 2022.03) LocATe: End-to-end Localization of Actions in 3D with Transformers, [Paper]
(arXiv 2022.03) AnoViT: Unsupervised Anomaly Detection and Localization with Vision Transformer-based Encoder-Decoder, [Paper]
(arXiv 2022.03) ViM: Out-Of-Distribution with Virtual-logit Matching, [Paper], [Code]
(arXiv 2022.03) ScalableViT: Rethinking the Context-oriented Generalization of Vision Transformer, [Paper]
(arXiv 2022.03) Iwin: Human-Object Interaction Detection via Transformer with Irregular Windows, [Paper]
(arXiv 2022.03) Vision Transformer with Convolutions Architecture Search, [Paper]
(arXiv 2022.03) Cascade Transformers for End-to-End Person Search, [Paper], [Code]
(arXiv 2022.03) CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance, [Paper]
(arXiv 2022.03) MatchFormer: Interleaving Attention in Transformers for Feature Matching, [Paper], [Code]
(arXiv 2022.03) Local-Global Context Aware Transformer for Language-Guided Video Segmentation, [Paper], [Code]
(arXiv 2022.03) Three things everyone should know about Vision Transformers, [Paper]
(arXiv 2022.03) Are Vision Transformers Robust to Spurious Correlations? [Paper], [Code]
(arXiv 2022.03) MUTUAL GENERATIVE TRANSFORMER LEARNING FOR CROSS-VIEW GEO-LOCALIZATION, [Paper]
(arXiv 2022.03) DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training, [Paper]
(arXiv 2022.03) Semantic-aligned Fusion Transformer for One-shot Object Detection, [Paper]
(arXiv 2022.03) UNIMO-2: End-to-End Unified Vision-Language Grounded Learning, [Paper], [Code]
(arXiv 2022.03) Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning, [Paper], [Code]
(arXiv 2022.03) One-Shot Adaptation of GAN in Just One CLIP, [Paper]
(arXiv 2022.03) PanoFormer: Panorama Transformer for Indoor 360° Depth Estimation, [Paper]
(arXiv 2022.03) PreTR: Spatio-Temporal Non-Autoregressive Trajectory Prediction Transformer, [Paper]
(arXiv 2022.03) Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image, [Paper], [Code]
(arXiv 2022.03) Transframer: Arbitrary Frame Prediction with Generative Models, [Paper]
(arXiv 2022.03) Towards Data-Efficient Detection Transformers, [Paper], [Code]
(arXiv 2022.03) Bi-directional Object-Context Prioritization Learning for Saliency Ranking, [Paper], [Code]
(arXiv 2022.03) PATCH-FOOL: ARE VISION TRANSFORMERS ALWAYS ROBUST AGAINST ADVERSARIAL PERTURBATIONS? [Paper], [Code]
(arXiv 2022.03) WegFormer: Transformers for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Open Set Recognition using Vision Transformer with an Additional Detection Head, [Paper], [Code]
(arXiv 2022.03) UNIFIED VISUAL TRANSFORMER COMPRESSION, [Paper], [Code]
(arXiv 2022.03) Towards Practical Certifiable Patch Defense with Vision Transformer, [Paper]
(arXiv 2022.03) EDTER: Edge Detection with Transformer, [Paper], [Code]
(arXiv 2022.03) ActFormer: A GAN Transformer Framework towards General Action-Conditioned 3D Human Motion Generation, [Paper]
(arXiv 2022.03) Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution, [Paper]
(arXiv 2022.03) Revitalize Region Feature for Democratizing Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) Inverted Pyramid Multi-task Transformer for Dense Scene Understanding, [Paper]
(arXiv 2022.03) Smoothing Matters: Momentum Transformer for Domain Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Style Transformer for Image Inversion and Editing, [Paper], [Code]
(arXiv 2022.03) MotionCLIP: Exposing Human Motion Generation to CLIP Space, [Paper], [Project]
(arXiv 2022.03) The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy, [Paper], [Code]
(arXiv 2022.03) Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation, [Paper]
(arXiv 2022.03) Sparse Local Patch Transformer for Robust Face Alignment and Landmarks Inherent Relation Learning, [Paper], [Code]
(arXiv 2022.03) Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting, [Paper]
(arXiv 2022.03) DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection, [Paper]
(arXiv 2022.03) DATR: Domain-adaptive transformer for multi-domain landmark detection, [Paper]
(arXiv 2022.03) EventFormer: AU Event Transformer for Facial Action Unit Event Detection, [Paper]
(arXiv 2022.03) Accelerating DETR Convergence via Semantic-Aligned Matching, [Paper], [Code]
(arXiv 2022.03) All in One: Exploring Unified Video-Language Pre-training, [Paper], [Code]
(arXiv 2022.03) CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment, [Paper]
(arXiv 2022.03) EIT: Efficiently Lead Inductive Biases to ViT, [Paper], [Code]
(arXiv 2022.03) Self-Promoted Supervision for Few-Shot Transformer, [Paper], [Code]
(arXiv 2022.03) MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization, [Paper]
(arXiv 2022.03) Disentangled Representation Learning for Text-Video Retrieval, [Paper]
(arXiv 2022.03) TransCAM: Transformer Attention-based CAM Refinement for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding, [Paper], [Dataset]
(arXiv 2022.03) Visualizing and Understanding Patch Interactions in Vision Transformer, [Paper]
(arXiv 2022.03) ANTI-OVERSMOOTHING IN DEEP VISION TRANSFORMERS VIA THE FOURIER DOMAIN ANALYSIS: FROM THEORY TO PRACTICE, [Paper], [Code]
(arXiv 2022.03) Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision, [Paper], [Code]
(arXiv 2022.03) ActiveMLP: An MLP-like Architecture with Active Token Mixer, [Paper], [Code]
(arXiv 2022.03) Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding, [Paper]
(arXiv 2022.03) TrueType Transformer: Character and Font Style Recognition in Outline Format, [Paper]
(arXiv 2022.03) LOOPITR: Combining Dual and Cross Encoder Architectures for Image-Text Retrieval, [Paper]
(arXiv 2022.03) MVP: Multimodality-guided Visual Pre-training, [Paper]
(arXiv 2022.03) DEER: Detection-agnostic End-to-End Recognizer for Scene Text Spotting, [Paper]
(arXiv 2022.03) Multi-Modal Mixup for Robust Fine-tuning, [Paper]
(arXiv 2022.03) AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant, [Paper], [Project]
(arXiv 2022.03) Coarse-to-Fine Vision Transformer, [Paper], [Code]
(arXiv 2022.03) Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, [Paper]
(arXiv 2022.03) WAVEMIX: RESOURCE-EFFICIENT TOKEN MIXING FOR IMAGES, [Paper]
(arXiv 2022.03) VOVIT: LOW LATENCY GRAPH-BASED AUDIO-VISUAL VOICE SEPARATION TRANSFORMER, [Paper], [Code]
(arXiv 2022.03) Graph Attention Transformer Network for Multi-Label Image Classification, [Paper]
(arXiv 2022.03) EDGEFORMER: IMPROVING LIGHT-WEIGHT CONVNETS BY LEARNING FROM VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.03) Skating-Mixer: Multimodal MLP for Scoring Figure Skating, [Paper]
(arXiv 2022.03) Dynamic Group Transformer: A General Vision Transformer Backbone with Dynamic Group Attention, [Paper]
(arXiv 2022.03) CP-ViT: Cascade Vision Transformer Pruning via Progressive Sparsity Prediction, [Paper]
(arXiv 2022.03) Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language Transfer Learning, [Paper]
(arXiv 2022.03) ChiTransformer: Towards Reliable Stereo from Cues, [Paper]
(arXiv 2022.03) A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation,** Co-Saliency Detection** and Video Salient Object Detection, [Paper], [Code]
(arXiv 2022.03) Coarse-to-Fine Sparse Transformer for Hyperspectral Image Reconstruction, [Paper]
(arXiv 2022.03) CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) Multiscale Transformer for Hyperspectral Image Classification, [Paper]
(arXiv 2022.03) Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning, [Paper], [Code]
(arXiv 2022.03) Autoregressive Image Generation using Residual Quantization, [Paper]
(arXiv 2022.03) CONTEXTFORMER: A TRANSFORMER WITH SPATIO-CHANNEL ATTENTION FOR CONTEXT MODELING IN LEARNED IMAGE COMPRESSION, [Paper]
(arXiv 2022.03) Patch Similarity Aware Data-Free Quantization for Vision Transformers, [Paper]
(arXiv 2022.03) ViT-P: Rethinking Data-efficient Vision Transformers from Locality, [Paper]
(arXiv 2022.03) DIT: SELF-SUPERVISED PRE-TRAINING FOR DOCUMENT IMAGE TRANSFORMER, [Paper]
(arXiv 2022.03) Towards Efficient and Scalable Sharpness-Aware Minimization, [Paper]
(arXiv 2022.03) HyperTransformer: A Textural and Spectral Feature Fusion Transformer for Pansharpening, [Paper], [Code]
(arXiv 2022.03) UVCGAN: UNET VISION TRANSFORMER CYCLE-CONSISTENT GAN FOR UNPAIRED IMAGE-TO-IMAGE TRANSLATION, [Paper], [Code]
(arXiv 2022.03) Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning, [Paper], [Code]
(arXiv 2022.03) PANFORMER: A TRANSFORMER BASED MODEL FOR PAN-SHARPENING, [Paper], [Code]
(arXiv 2022.03) Multi-class Token Transformer for Weakly Supervised Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Cross Language Image Matching for Weakly Supervised Semantic Segmentation, [Paper]
(arXiv 2022.03) Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers, [Paper], [Code]
(arXiv 2022.03) DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection, [Paper], [Code]
(arXiv 2022.03) MetaFormer : A Unified Meta Framework for Fine-Grained Recognition, [Paper], [Code]
(arXiv 2022.03) Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language, [Paper]
(arXiv 2022.03) Knowledge Amalgamation for Object Detection with Transformers, [Paper]
(arXiv 2022.03) Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos, [Paper]
(arXiv 2022.03) Modeling Coreference Relations in Visual Dialog, [Paper], [Code]
(arXiv 2022.03) VITRANSPAD: VIDEO TRANSFORMER USING CONVOLUTION AND SELF-ATTENTION FOR FACE PRESENTATION ATTACK DETECTION, [Paper]
(arXiv 2022.03) Multi-Tailed Vision Transformer for Efficient Inference, [Paper]
(arXiv 2022.03) Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation, [Paper], [Code]
(arXiv 2022.03) Ensembles of Vision Transformers as a New Paradigm for Automated Classification in Ecology, [Paper]
(arXiv 2022.03) LGT-Net: Indoor Panoramic Room Layout Estimation with Geometry-Aware Transformer Network, [Paper], [Code]
(arXiv 2022.03) LatentFormer: Multi-Agent Transformer-Based Interaction Modeling and Trajectory Prediction, [Paper]
(arXiv 2022.03) DCT-Former: Efficient Self-Attention with Discrete Cosine Transform, [Paper], [Code]
(arXiv 2022.03) Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment, [Paper]
(arXiv 2022.03) Spatiotemporal Transformer Attention Network for 3D Voxel Level Joint Segmentation and Motion Prediction in Point Cloud, [Paper]
(arXiv 2022.03) CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP, [Paper]
(arXiv 2022.03) MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video, [Paper]
(arXiv 2022.03) X -Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning, [Paper]
(arXiv 2022.03) 3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification, [Paper]
(arXiv 2022.03) DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation, [Paper]
(arXiv 2022.03) D_2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention, [Paper]
(arXiv 2022.03) Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding, [Paper], [Code]
(arXiv 2022.03) Self-supervised Transformer for Deepfake Detection, [Paper]
(arXiv 2022.03) Aggregated Pyramid Vision Transformer: Splittransform-merge Strategy for Image Recognition without Convolutions, [Paper]
(arXiv 2022.03) TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration, [Paper], [Code]
(arXiv 2022.03) DN-DETR: Accelerate DETR Training by Introducing Query DeNoising, [Paper], [Code]
(arXiv 2022.03) Protecting Celebrities with Identity Consistency Transformer, [Paper]
(arXiv 2022.03) Masked Visual Pre-training for Motor Control, [Paper], [Project]
(arXiv 2022.03) NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, [Paper], [Code]
(arXiv 2022.03) Conditional Prompt Learning for Vision-Language Models, [Paper], [Code]
(arXiv 2022.03) Lane Detection with Versatile AtrousFormer and Local Semantic Guidance, [Paper]
(arXiv 2022.03) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.03) Forecasting Characteristic 3D Poses of Human Actions , [Paper], [Code]

2022.02

(arXiv 2022.02) Bayesian Structure Learning with Generative Flow Networks, [Paper]
(arXiv 2022.02) Towards Unsupervised Domain Adaptation via Domain-Transformer, [Paper]
(arXiv 2022.02) An End-to-End Transformer Model for Crowd Localization, [Paper]
(arXiv 2022.02) Instantaneous Physiological Estimation using Video Transformers, [Paper], [Code]
(arXiv 2022.02) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Translation, [Paper], [Code]
(arXiv 2022.02) ATTENTION ENABLES ZERO APPROXIMATION ERROR, [Paper]
(arXiv 2022.02) When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection, [Paper], [Code]
(arXiv 2022.02) AUTO-SCALING VISION TRANSFORMERS WITHOUT TRAINING, [Paper], [Code]
(arXiv 2022.02) Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2022.02) LEARNING TO MERGE TOKENS IN VISION TRANSFORMERS, [Paper]
(arXiv 2022.02) ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers, [Paper], [Code]
(arXiv 2022.02) SELF-SUPERVISED TRANSFORMERS FOR UNSUPERVISED OBJECT DISCOVERY USING NORMALIZED CUT, [Paper], [Project]
(arXiv 2022.02) Paying U-Attention to Textures: Multi-Stage Hourglass Vision Transformer for Universal Texture Synthesis, [Paper]
(arXiv 2022.02) CaMEL: Mean Teacher Learning for Image Captioning, [Paper]
(arXiv 2022.02) Hierarchical Perceiver, [Paper]
(arXiv 2022.02) Movies2Scenes: Learning Scene Representations Using Movie Similarities, [Paper]
(arXiv 2022.02) GroupViT: Semantic Segmentation Emerges from Text Supervision, [Paper], [[Code
(arXiv 2022.02) Snowflake Point Deconvolution for Point Cloud Completion and Generation with Skip-Transformer, [Paper], [Code]
(arXiv 2022.02) Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations, [Paper]
(arXiv 2022.02) ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond, [Paper]
(arXiv 2022.02) PMP-Net++: Point Cloud Completion by Transformer-Enhanced Multi-step Point Moving Paths, [Paper], [Code]
(arXiv 2022.02) DataMUX: Data Multiplexing for Neural Networks, [Paper], [Code]
(arXiv 2022.02) On Guiding Visual Attention with Language Specification, [Paper]
(arXiv 2022.02) SPATIO-TEMPORAL OUTDOOR LIGHTING AGGREGATION ON IMAGE SEQUENCES USING TRANSFORMER NETWORKS, [Paper]
(arXiv 2022.02) MISINFORMATION DETECTION IN SOCIAL MEDIA VIDEO POSTS, [Paper]
(arXiv 2022.02) Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [Paper]
(arXiv 2022.02) NOT ALL PATCHES ARE WHAT YOU NEED: EXPEDITING VISION TRANSFORMERS VIA TOKEN REORGANIZATIONS, [Paper], [Code]
(arXiv 2022.02) ActionFormer: Localizing Moments of Actions with Transformers, [Paper], [Code]
(arXiv 2022.02) One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones, [Paper]
(arXiv 2022.02) XAI for Transformers: Better Explanations through Conservative Propagation, [Paper]
(arXiv 2022.02) MeshLeTemp: Leveraging the Learnable Vertex-Vertex Relationship to Generalize Human Pose and Mesh Reconstruction for In-the-Wild Scenes, [Paper]
(arXiv 2022.02) ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer, [Paper]
(arXiv 2022.02) Hyper-relationship Learning Network for Scene Graph Generation, [Paper]
(arXiv 2022.02) CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval, [Paper]
(arXiv 2022.02) Flowformer: Linearizing Transformers with Conservation Flows, [Paper]
(arXiv 2022.02) DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following, [Paper], [Code]
(arXiv 2022.02) CATs++: Boosting Cost Aggregation with Convolutions and Transformers, [Paper]
(arXiv 2022.02) Geometric Transformer for Fast and Robust Point Cloud Registration, [Paper], [Code]
(arXiv 2022.02) I-Tuning: Tuning Language Models with Image for Caption Generation, [[Paper]](I-Tuning: Tuning Language Models with Image for Caption Generation)
(arXiv 2022.02) Multi-direction and Multi-scale Pyramid in Transformer for Video-based Pedestrian Retrieval, [Paper], [Code]
(arXiv 2022.02) Visual Acoustic Matching, [Paper]
(arXiv 2022.02) LighTN: Light-weight Transformer Network for Performance-overhead Tradeoff in Point Cloud Downsampling, [Paper]
(arXiv 2022.02) BViT: Broad Attention based Vision Transformer, [Paper], [Code]
(arXiv 2022.02) Task-Adaptive Feature Transformer with Semantic Enrichment for Few-Shot Segmentation, [Paper]
(arXiv 2022.02) Domain Adaptation via Prompt Learning, [Paper]
(arXiv 2022.02) Mixing and Shifting: Exploiting Global and Local Dependencies in Vision MLPs, [Paper], [Code]
(arXiv 2022.02) Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework, [Paper], [Project]
(arXiv 2022.02) HOW DO VISION TRANSFORMERS WORK? [Paper], [Code]
(arXiv 2022.02) ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning, [Paper], [Code]
(arXiv 2022.02) CLIPasso: Semantically-Aware Object Sketching, [Paper], [Code]
(arXiv 2022.02) Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer, [Paper]
(arXiv 2022.02) DEEP SOCCER CAPTIONING WITH TRANSFORMER: DATASET, SEMANTICS-RELATED LOSSES, AND MULTI-LEVEL EVALUATION, [Paper], [Project]
(arXiv 2022.02) ENTROFORMER: A TRANSFORMER-BASED ENTROPY MODEL FOR LEARNED IMAGE COMPRESSION, [Paper], [Code]
(arXiv 2022.02) Image Difference Captioning with Pre-training and Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) MaskGIT: Masked Generative Image Transformer, [Paper]
(arXiv 2022.02) Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning, [Paper]
(arXiv 2022.02) Motion-Aware Transformer For Occluded Person Re-identification, [Paper]
(arXiv 2022.02) Conditional Motion In-betweening, [Paper], [Code]
(arXiv 2022.02) Memory-based gaze prediction in deep imitation learning for robot manipulation, [Paper]
(arXiv 2022.02) Spherical Transformer, [Paper]
(arXiv 2022.02) OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context, [Paper]
(arXiv 2022.02) The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning, [Paper], [Project]
(arXiv 2022.02) DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) TransFollower: Long-Sequence Car-Following Trajectory Prediction through Transformer, [Paper]
(arXiv 2022.02) The devil is in the labels: Semantic segmentation from sentences, [Paper]
(arXiv 2022.02) Webly Supervised Concept Expansion for General Purpose Vision Models, [Paper], [Project]
(arXiv 2022.02) VU-BERT: A UNIFIED FRAMEWORK FOR VISUAL DIALOG, [Paper]
(arXiv 2022.02) UNIFYING ARCHITECTURES, TASKS, AND MODALITIES THROUGH A SIMPLE SEQUENCE-TO-SEQUENCE LEARNING FRAMEWORK, [Paper], [Code]
(arXiv 2022.02) Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics, [Paper]
(arXiv 2022.02) TRANSDREAMER: REINFORCEMENT LEARNING WITH TRANSFORMER WORLD MODELS, [Paper]
(arXiv 2022.02) Vision-Language Pre-Training with Triple Contrastive Learning, [Paper], [Code]
(arXiv 2022.02) Corrupted Image Modeling for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2022.02) BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, [Paper], [Code]
(arXiv 2022.02) DNNFuser: Generative Pre-Trained Transformer as a Generalized Mapper for Layer Fusion in DNN Accelerators, [Paper]
(arXiv 2022.02) Interactron: Embodied Adaptive Object Detection, [Paper]
(arXiv 2022.02) Local Feature Matching with Transformers for low-end devices LoFTR method adaptation approach, [Paper], [Code]
(arXiv 2022.02) Pre-Trained Language Models for Interactive Decision-Making, [Paper]
(arXiv 2022.02) Can Transformers be Strong Treatment Effect Estimators?, [Paper]
(arXiv 2022.02) Improving Sample Efficiency of Value Based Models Using Attention and Vision Transformers, [Paper]
(arXiv 2022.02) Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics, [Paper], [Code]

2022.01

(arXiv 2022.01) O-ViT: Orthogonal Vision Transformer, [Paper]
(arXiv 2022.01) DynaMixer: A Vision MLP Architecture with Dynamic Mixing, [Paper]
(arXiv 2022.01) VRT: A Video Restoration Transformer, [Paper], [Code]
(arXiv 2022.01) DAB-DETR: DYNAMIC ANCHOR BOXES ARE BETTER QUERIES FOR DETR, [Paper], [Code]
(arXiv 2022.01) Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations, [Paper]
(arXiv 2022.01) MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment, [Paper]
(arXiv 2022.01) VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training, [Paper]
(arXiv 2022.01) BOAT: Bilateral Local Attention Vision Transformer, [Paper]
(arXiv 2022.01) GRAPH SELF-ATTENTION FOR LEARNING GRAPH REPRESENTATION WITH TRANSFORMER, [Paper]
(arXiv 2022.01) Aggregating Global Features into Local Vision Transformer, [Paper], [Code]
(arXiv 2022.01) Transformer Module Networks for Systematic Generalization in Visual Question Answering, [Paper]
(arXiv 2022.01) Generalised Image Outpainting with U-Transformer, [Paper]
(arXiv 2022.01) RelTR: Relation Transformer for Scene Graph Generation, [Paper]
(arXiv 2022.01) DocSegTr: An Instance-Level End-to-End Document Image Segmentation Transformer, [Paper]
(arXiv 2022.01) Pre-Trained Language Transformers are Universal Image Classifiers, [Paper]
(arXiv 2022.01) Explore and Match: End-to-End Video Grounding with Transformer, [Paper]
(arXiv 2022.01) TGFuse: An Infrared and Visible Image Fusion Approach Based on Transformer and Generative Adversarial Network, [Paper]
(arXiv 2022.01) ViT-HGR: Vision Transformer-based Hand Gesture Recognition from High Density Surface EMG Signals, [Paper]
(arXiv 2022.01) ShapeFormer: Transformer-based Shape Completion via Sparse Representation, [Paper], [Project]
(arXiv 2022.01) CONVOLUTIONAL XFORMERS FOR VISION, [Paper], [Code]
(arXiv 2022.01) DocEnTr: An End-to-End Document Image Enhancement Transformer, [Paper], [Code]
(arXiv 2022.01) Zero-Shot Sketch Based Image Retrieval using Graph Transformer, [Paper]
(arXiv 2022.01) SA-VQA: Structured Alignment of Visual and Semantic Representations for Visual Question Answering, [Paper]
(arXiv 2022.01) DUAL-TASKS SIAMESE TRANSFORMER FRAMEWORK FOR BUILDING DAMAGE ASSESSMENT, [Paper]
(arXiv 2022.01) When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism, [Paper], [Code]
(arXiv 2022.01) Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation, [Paper]
(arXiv 2022.01) Training Vision Transformers with Only 2040 Images, [Paper]
(arXiv 2022.01) Learning To Recognize Procedural Activities with Distant Supervision, [Paper]
(arXiv 2022.01) EVALUATING LANGUAGE-BIASED IMAGE CLASSIFICATION BASED ON SEMANTIC REPRESENTATIONS, [Paper]
(arXiv 2022.01) A Comprehensive Study of Vision Transformers on Dense Prediction Tasks, [Paper]
(arXiv 2022.01) UniFormer: Unifying Convolution and Self-attention for Visual Recognition, [Paper], [Code]
(arXiv 2022.01) Patches Are All You Need? [Paper], [Code]
(arXiv 2022.01) Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval, [Paper]
(arXiv 2022.01) LEARNING TO ACT WITH AFFORDANCE-AWARE MULTIMODAL NEURAL SLAM, [Paper]
(arXiv 2022.01) Visual Information Guided Zero-Shot Paraphrase Generation, [Paper]
(arXiv 2022.01) TerViT: An Efficient Ternary Vision Transformer, [Paper]
(arXiv 2022.01) End-to-end Generative Pretraining for Multimodal Video Captioning, [Paper]
(arXiv 2022.01) OMNIVORE: A Single Model for Many Visual Modalities, [Paper], [Project]
(arXiv 2022.01) MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition, [Paper]
(arXiv 2022.01) The CLEAR Benchmark: Continual LEArning on Real-World Imagery, [Paper], [Project]
(arXiv 2022.01) ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues, [Paper]
(arXiv 2022.01) Cross-modal Contrastive Distillation for Instructional Activity Anticipation, [Paper]
(arXiv 2022.01) Transformers in Action: Weakly Supervised Action Segmentation, [Paper]
(arXiv 2022.01) VAQF: Fully Automatic Software-hardware Co-design Framework for Low-bit Vision Transformer, [Paper]
(arXiv 2022.01) CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks, [Paper]
(arXiv 2022.01) Domain Adaptation via Bidirectional Cross-Attention Transformer, [Paper]
(arXiv 2022.01) Continual Transformers: Redundancy-Free Attention for Online Inference, [Paper]
(arXiv 2022.01) Motion Inbetweening via Deep ∆-Interpolator, [Paper]
(arXiv 2022.01) RePre: Improving Self-Supervised Vision Transformer with Reconstructive Pre-training, [Paper]
(arXiv 2022.01) GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events, [Paper]
(arXiv 2022.01) TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning, [Paper]
(arXiv 2022.01) Q-ViT: Fully Differentiable Quantization for Vision Transformer, [Paper]
(arXiv 2022.01) Disentangled Latent Transformer for Interpretable Monocular Height Estimation, [Paper], [Project]
(arXiv 2022.01) Poseur: Direct Human Pose Regression with Transformers*, [Paper]
(arXiv 2022.01) SWINUNET3D - A HIERARCHICAL ARCHITECTURE FOR DEEP TRAFFIC PREDICTION USING SHIFTED WINDOW TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) SWIN-POSE: SWIN TRANSFORMER BASED HUMAN POSE ESTIMATION, [Paper]
(arXiv 2022.01) Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation, [Paper], [Project]
(arXiv 2022.01) ViT2Hash: Unsupervised Information-Preserving Hashing, [Paper]
(arXiv 2022.01) LANGUAGE-DRIVEN SEMANTIC SEGMENTATION, [Paper], [Code]
(arXiv 2022.01) Pedestrian Detection: Domain Generalization, CNNs, Transformers and Beyond, [Paper], [Code]
(arXiv 2022.01) ImageSubject: A Large-scale Dataset for Subject Detection, [Paper]
(arXiv 2022.01) Detecting Twenty-thousand Classes using Image-level Supervision, [Paper], [Code]
(arXiv 2022.01) Generalized Category Discovery, [Paper], [Code]
(arXiv 2022.01) Video Summarization Based on Video-text Modelling, [Paper]
(arXiv 2022.01) Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2022.01) QUADTREE ATTENTION FOR VISION TRANSFORMERS, [Paper], [Code]
(arXiv 2022.01) A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval, [Paper], [Project]
(arXiv 2022.01) MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound, [Paper], [Project]
(arXiv 2022.01) On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering, [Paper]
(arXiv 2022.01) Pyramid Fusion Transformer for Semantic Segmentation, [Paper]
(arXiv 2022.01) Multiview Transformers for Video Recognition, [Paper]
(arXiv 2022.01) HYPERTRANSFORMER: MODEL GENERATION FOR SUPERVISED AND SEMI-SUPERVISED FEW-SHOT LEARNING, [Paper]
(arXiv 2022.01) UNIFORMER: UNIFIED TRANSFORMER FOR EFFICIENT SPATIOTEMPORAL REPRESENTATION LEARNING, [Paper], [Code]
(arXiv 2022.01) BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions, [Paper], [Project]
(arXiv 2022.01) TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers, [Paper]
(arXiv 2022.01) CLIP-Event: Connecting Text and Images with Event Structures, [Paper], [Code]
(arXiv 2022.01) Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training, [Paper]
(arXiv 2022.01) Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention, [Paper], [Code]
(arXiv 2022.01) Self-Training Vision Language BERTs with a Unified Conditional Model, [Paper]
(arXiv 2022.01) TransVPR: Transformer-based TransVPR: Transformer-based place recognition with multi-level attention aggregation with multi-level attention aggregation, [Paper]
(arXiv 2022.01) Compact Bidirectional Transformer for Image Captioning, [Paper], [Code]
(arXiv 2022.01) Flow-Guided Sparse Transformer for Video Deblurring, [Paper]
(arXiv 2022.01) Stochastic Layers in Vision Transformers, [Paper]
(arXiv 2022.01) ERNIE-VILG: UNIFIED GENERATIVE PRE-TRAINING FOR BIDIRECTIONAL VISION-LANGUAGE GENERATION, [Paper]
(arXiv 2022.01) InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer, [Paper], [Code]
(arXiv 2022.01) CSformer: Bridging Convolution and Transformer for Compressive Sensing, [Paper]
(arXiv 2022.01) Persformer: A Transformer Architecture for Topological Machine Learning, [Paper]
(arXiv 2022.01) Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space, [Paper]
(arXiv 2022.01) Language as Queries for Referring Video Object Segmentation, [Paper], [Code]
(arXiv 2022.01) PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture, [Paper], [Code]
(arXiv 2022.01) A TRANSFORMER-BASED SIAMESE NETWORK FOR CHANGE DETECTION, [Paper], [Code]
(arXiv 2022.01) Vision Transformer with Deformable Attention, [Paper], [Code]
(arXiv 2022.01) Splicing ViT Features for Semantic Appearance Transfer, [Paper], [Project]
(arXiv 2022.01) Detail-Preserving Transformer for Light Field Image Super-Resolution, [Paper], [Code]

2021.12

(arXiv 2021.12) Multi-Dimensional Model Compression of Vision Transformer, [Paper]
(arXiv 2021.12) Siamese Network with Interactive Transformer for Video Object Segmentation, [Paper], [Code]
(arXiv 2021.12) Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Atention, [Paper], [Code]
(arXiv 2021.12) APRIL: Finding the Achilles’ Heel on Privacy for Vision Transformers, [Paper]
(arXiv 2021.12) Synchronized Audio-Visual Frames with Fractional Positional Encoding for Transformers in Video-to-Text Translation, [Paper]
(arXiv 2021.12) Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?, [Paper]
(arXiv 2021.12) SPViT: Enabling Faster Vision Transformers via Soft Token Pruning, [Paper]
(arXiv 2021.12) A FISTFUL OF WORDS: LEARNING TRANSFERABLE VISUAL MODELS FROM BAG-OF-WORDS SUPERVISION, [Paper]
(arXiv 2021.12) StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2, [Paper], [Code]
(arXiv 2021.12) A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model, [Paper], [Code]
(arXiv 2021.12) Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence, [Paper]
(arXiv 2021.12) SIMVIT: EXPLORING A SIMPLE VISION TRANSFORMER WITH SLIDING WINDOWS, [Paper], [Code]
(arXiv 2021.12) SGTR: End-to-end Scene Graph Generation with Transformer, [Paper]
(arXiv 2021.12) Video Joint Modelling Based on Hierarchical Transformer for Co-summarization, [Paper]
(arXiv 2021.12) Vision Transformer for Small-Size Datasets, [Paper]
(arXiv 2021.12) Learning Generative Vision Transformer with Energy-Based Latent Space for Saliency Prediction, [Paper]
(arXiv 2021.12) ViR: the Vision Reservoir, [Paper]
(arXiv 2021.12) SeMask: Semantically Masked Transformers for Semantic Segmentation, [Paper], [Code]
(arXiv 2021.12) Open-Vocabulary Image Segmentation, [Paper]
(arXiv 2021.12) ELSA: Enhanced Local Self-Attention for Vision Transformer, [Paper], [Code]
(arXiv 2021.12) LaTr: Layout-Aware Transformer for Scene-Text VQA, [Paper]
(arXiv 2021.12) Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding, [Paper]
(arXiv 2021.12) Fine-grained Multi-Modal Self-Supervised Learning, [Paper]
(arXiv 2021.12) SLIP: Self-supervision meets Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes, [Paper]
(arXiv 2021.12) MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input Adaptation, [Paper]
(arXiv 2021.12) iSegFormer: Interactive Image Segmentation with Transformers, [Paper], [Code]
(arXiv 2021.12) Contrastive Object Detection Using Knowledge Graph Embeddings, [Paper]
(arXiv 2021.12) RepMLPNet: Hierarchical Vision MLP with Re-parameterized Locality, [Paper], [Code]
(arXiv 2021.12) Lite Vision Transformer with Enhanced Self-Attention, [Paper], [Code]
(arXiv 2021.12) MPViT : Multi-Path Vision Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.12) SOIT: Segmenting Objects with Instance-Aware Transformers, [Paper], [Code]
(arXiv 2021.12) Learned Queries for Efficient Local Attention, [Paper], [Code]
(arXiv 2021.12) On Efficient Transformer and Image Pre-training for Low-level Vision, [Paper], [Code]
(arXiv 2021.12) LOCFORMER: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach, [Paper]
(arXiv 2021.12) Tell me what you see: A zero-shot action recognition method based on natural language descriptions, [Paper], [Code]
(arXiv 2021.12) Pre-Training Transformers for Domain Adaptation, [Paper]
(arXiv 2021.12) ScanQA: 3D Question Answering for Spatial Scene Understanding, [Paper]
(arXiv 2021.12) Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [Paper]
(arXiv 2021.12) StyleSwin: Transformer-based GAN for High-resolution Image Generation, [Paper], [Code]
(arXiv 2021.12) Mask2Former for Video Instance Segmentation, [Paper], [Code]
(arXiv 2021.12) GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models, [Paper], [Code]
(arXiv 2021.12) Efficient Visual Tracking with Exemplar Transformers, [Paper], [Code]
(arXiv 2021.12) Neuromorphic Camera Denoising using Graph Neural Network-driven Transformers, [Paper]
(arXiv 2021.12) Align and Prompt: Video-and-Language Pre-training with Entity Prompts, [Paper], [Code]
(arXiv 2021.12) DATA EFFICIENT LANGUAGE-SUPERVISED ZEROSHOT RECOGNITION WITH OPTIMAL TRANSPORT DISTILLATION, [Paper]
(arXiv 2021.12) SiamTrans: Zero-Shot Multi-Frame Image Restoration with Pre-Trained Siamese Transformers, [Paper]
(arXiv 2021.12) Full Transformer Framework for Robust Point Cloud Registration with Deep Information Interaction, [Paper], [Code]
(arXiv 2021.12) ZeroVL: A Strong Baseline for Aligning Vision-Language Representations with Limited Resources, [Paper]
(arXiv 2021.12) Towards End-to-End Image Compression and Analysis with Transformers, [Paper]
(arXiv 2021.12) How to augment your ViTs? Consistency loss and StyleAug, a random style transfer augmentation, [Paper]
(arXiv 2021.12) Learning to Prompt for Continual Learning, [Paper], [Code]
(arXiv 2021.12) Distilled Dual-Encoder Model for Vision-Language Understanding, [Paper], [Code]
(arXiv 2021.12) Dense Video Captioning Using Unsupervised Semantic Information, [Paper], [Code]
(arXiv 2021.12) Looking Outside the Box to Ground Language in 3D Scenes, [Paper], [Code]
(arXiv 2021.12) RegionCLIP: Region-based Language-Image Pretraining, [Paper], [Code]
(arXiv 2021.12) DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer, [Paper]
(arXiv 2021.12) Masked Feature Prediction for Self-Supervised Visual Pre-Training, [Paper]
(arXiv 2021.12) SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning, [Paper]
(arXiv 2021.12) TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos, [Paper], [Code]
(arXiv 2021.12) Co-training Transformer with Videos and Images Improves Action Recognition, [Paper]
(arXiv 2021.12) QAHOI: Query-Based Anchors for Human-Object Interaction Detection, [Paper], [Code]
(arXiv 2021.12) AdaViT: Adaptive Tokens for Efficient Vision Transformer, [Paper]
(arXiv 2021.12) CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations, [Paper]
(arXiv 2021.12) Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, [Paper]
(arXiv 2021.12) Deep ViT Features as Dense Visual Descriptors, [Paper], [Project]
(arXiv 2021.12) Geometry-Contrastive Transformer for Generalized 3D Pose Transfer, [Paper], [Code]
(arXiv 2021.12) Temporal Transformer Networks with Self-Supervision for Action Recognition, [Paper]
(arXiv 2021.12) COMPOSER: Compositional Learning of Group Activity in Videos, [Paper]
(arXiv 2021.12) Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition, [Paper]
(arXiv 2021.12) Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection, [Paper]
(arXiv 2021.12) SVIP: Sequence VerIfication for Procedures in Videos, [Paper]
(arXiv 2021.12) Improving Vision Transformers for Incremental Learning, [Paper]
(arXiv 2021.12) VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks, [Paper], [Code]
(arXiv 2021.12) Embracing Single Stride 3D Object Detector with Sparse Transformer, [Paper], [Code]
(arXiv 2021.12) PartGlot: Learning Shape Part Segmentation from Language Reference Games, [Paper]
(arXiv 2021.12) Pedestrian Trajectory Prediction via Spatial Interaction Transformer Network, [Paper]
(arXiv 2021.12) LEARNING SEMANTIC-ALIGNED FEATURE REPRESENTATION FOR TEXT-BASED PERSON SEARCH, [Paper]
(arXiv 2021.12) L-Verse: Bidirectional Generation Between Image and Text, [Paper]
(arXiv 2021.12) SELF-ATTENTION DOES NOT NEED O(n^2) MEMORY, [Paper]
(arXiv 2021.12) Are Vision Transformers Robust to Patch Perturbations? [Paper]
(arXiv 2021.12) Mesa: A Memory-saving Training Framework for Transformers, [Paper], [Code]
(arXiv 2021.12) Injecting Semantic Concepts into End-to-End Image Captioning, [Paper]
(arXiv 2021.12) MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning, [Paper]
(arXiv 2021.12) LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization, [Paper]
(arXiv 2021.12) FaceFormer: Speech-Driven 3D Facial Animation with Transformers, [Paper]
(arXiv 2021.12) Rethinking the Two-Stage Framework for Grounded Situation Recognition, [Paper], [Code]
(arXiv 2021.12) CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions, [Paper]
(arXiv 2021.12) Couplformer: Rethinking Vision Transformer with Coupling Attention Map, [Paper]
(arXiv 2021.12) Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation, [Paper]
(arXiv 2021.12) Visual Transformers with Primal Object Queries for Multi-Label Image Classification, [Paper]
(arXiv 2021.12) Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training, [Paper], [Code]
(arXiv 2021.12) MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection, [Paper]
(arXiv 2021.12) Grounded Language-Image Pre-training, [Paper], [Code]
(arXiv 2021.12) U^2-Former: A Nested U-shaped Transformer for Image Restoration, [Paper]
(arXiv 2021.12) ADAPTIVE CHANNEL ENCODING TRANSFORMER FOR POINT CLOUD ANALYSIS, [Paper]
(arXiv 2021.12) Pose-guided Feature Disentangling for Occluded Person Re-identification Based on Transformer, [Paper], [Code]
(arXiv 2021.12) VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts, [Paper]
(arXiv 2021.12) PointCLIP: Point Cloud Understanding by CLIP, [Paper], [Code]
(arXiv 2021.12) Learning Tracking Representations via Dual-Branch Fully Transformer Networks, [Paper], [Code]
(arXiv 2021.12) DYNAMIC TOKEN NORMALIZATION IMPROVES VISION TRANSFORMER, [Paper], [Code]
(arXiv 2021.12) PTTR: Relational 3D Point Cloud Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.12) GETAM: Gradient-weighted Element-wise Transformer Attention Map for Weakly-supervised Semantic segmentation, [Paper]
(arXiv 2021.12) Text2Mesh: Text-Driven Neural Stylization for Meshes, [Paper], [Project]
(arXiv 2021.12) LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences, [Paper]
(arXiv 2021.12) Make A Long Image Short: Adaptive Token Length for Vision Transformers, [Paper]
(arXiv 2021.12) FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization, [Paper], [Code]
(arXiv 2021.12) TransZero: Attribute-guided Transformer for Zero-Shot Learning, [Paper], [Code]
(arXiv 2021.12) Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer, [Paper], [Code]
(arXiv 2021.12) Hformer: Hybrid CNN-Transformer for Fringe Order Prediction in Phase Unwrapping of Fringe Projection, [Paper]
(arXiv 2021.12) Pre-training and Fine-tuning Transformers for fMRI Prediction Tasks, [Paper]
(arXiv 2021.12) Transformer based trajectory prediction, [Paper]
(arXiv 2021.12) Evaluating Transformers for Lightweight Action Recognition, [Paper]
(arXiv 2021.12) Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision, [Paper]
(arXiv 2021.12) CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, [Paper]
(arXiv 2021.12) Bootstrapping ViTs: Towards Liberating Vision Transformers from Pre-training, [Paper]
(arXiv 2021.12) Decision-based Black-box Attack Against Vision Transformers via Patch-wise Adversarial Removal, [Paper], [Code]
(arXiv 2021.12) DoodleFormer: Creative Sketch Drawing with Transformers, [Paper]
(arXiv 2021.12) Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning, [Paper]
(arXiv 2021.12) AUDIO-VISUAL SYNCHRONISATION IN THE WILD, [Paper], [Project]
(arXiv 2021.12) Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs, [Paper]
(arXiv 2021.12) Garment4D: Garment Reconstruction from Point Cloud Sequences, [Paper], [Code]
(arXiv 2021.12) Locally Shifted Attention**** With Early Global Integration, [Paper], [Code]
(arXiv 2021.12) BLT: Bidirectional Layout Transformer for Controllable Layout Generation, [Paper]
(arXiv 2021.12) PE-former: Pose Estimation Transformer, [Paper], [Project]
(arXiv 2021.12) HairCLIP: Design Your Hair by Text and Reference Image, [Paper], [Project]
(arXiv 2021.12) CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields, [Paper], [Code]
(arXiv 2021.12) A Bilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer, [Paper], [Code], [Dataset]
(arXiv 2021.12) DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition, [Paper], [Code]
(arXiv 2021.12) Recurrent Glimpse-based Decoder for Detection with Transformer, [Paper], [Code]
(arXiv 2021.12) Fast Point Transformer, [Paper]
(arXiv 2021.12) Assistive Tele-op: Leveraging Transformers to Collect Robotic Task Demonstrations, [Paper], [Project]
(arXiv 2021.12) Cross-Modality Fusion Transformer for Multispectral Object Detection, [Paper]
(arXiv 2021.12) PatchFormer: An Efficient Point Transformer with Patch Attention, [Paper]
(arXiv 2021.12) Transformer-Based Approach for Joint Handwriting and Named Entity Recognition in Historical documents, [Paper]
(arXiv 2021.12) MLP Architectures for Vision-and-Language Modeling: An Empirical Study, [Paper], [Code]
(arXiv 2021.12) Everything at Once – Multi-modal Fusion Transformer for Video Retrieval, [Paper]
(arXiv 2021.12) Prompting Visual-Language Models for Efficient Video Understanding, [Paper], [Project]
(arXiv 2021.12) FLAVA: A Foundational Language And Vision Alignment Model, [Paper]
(arXiv 2021.12) Embedding Arithmetic for Text-driven Image Transformation, [Paper]
(arXiv 2021.12) LAVT: Language-Aware Vision Transformer for Referring Image Segmentation, [Paper]
(arXiv 2021.12) Look at What I’m Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos, [Paper], [Project]
(arXiv 2021.12) Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, [Paper]
(arXiv 2021.12) DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting, [Paper], [Code]
(arXiv 2021.12) Self-supervised Video Transformer, [Paper], [Code]
(arXiv 2021.12) OW-DETR: Open-world Detection Transformer, [Paper]
(arXiv 2021.12) Zero-Shot Text-Guided Object Generation with Dream Fields, [Paper], [Project]
(arXiv 2021.12) Video-Text Pre-training with Learned Regions, [Paper], [Code]
(arXiv 2021.12) MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object Detection, [Paper]
(arXiv 2021.12) TCTN: A 3D-Temporal Convolutional Transformer Network for Spatiotemporal Predictive Learning, [Paper]
(arXiv 2021.12) DenseCLIP: Extract Free Dense Labels from CLIP, [Paper]
(arXiv 2021.12) TransMEF: A Transformer-Based Multi-Exposure Image Fusion Framework using Self-Supervised Multi-Task Learning, [Paper]
(arXiv 2021.12) SwinTrack: A Simple and Strong Baseline for Transformer Tracking, [Paper], [Code]
(arXiv 2021.12) Object-Centric Unsupervised Image Captioning, [Paper]
(arXiv 2021.12) Vision Pair Learning: An Efficient Training Framework for Image Classification, [Paper]
(arXiv 2021.12) Visual-Semantic Transformer for Scene Text Recognition, [Paper]
(arXiv 2021.12) Differentiable Spatial Planning using Transformers, [Paper], [Project]
(arXiv 2021.12) Improved Multiscale Vision Transformers for Classification and Detection, [Paper]
(arXiv 2021.12) Masked-attention Mask Transformer for Universal Image Segmentation, [Paper], [Code]
(arXiv 2021.12) BEVT: BERT Pretraining of Video Transformers, [Paper]
(arXiv 2021.12) Human-Object Interaction Detection via Weak Supervision, [Paper]
(arXiv 2021.12) Learning Transformer Features for Image Quality Assessment, [Paper]
(arXiv 2021.12) CLIPstyler: Image Style Transfer with a Single Text Condition, [Paper]
(arXiv 2021.12) Multi-View Stereo with Transformer, [Paper]
(arXiv 2021.12) VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion, [Paper], [Code]
(arXiv 2021.12) Object-aware Video-language Pre-training for Retrieval, [Paper], [Code]

2021.11

(arXiv 2021.11) Multi-modal Transformers Excel at Class-agnostic Object Detection, [Paper], [Code]
(arXiv 2021.11) Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model, [Paper]
(arXiv 2021.11) NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition, [Paper], [Code]
(arXiv 2021.11) PolyViT: Co-training Vision Transformers on Images, Videos and Audio, [Paper]
(arXiv 2021.11) SWAT: Spatial Structure Within and Among Tokens, [Paper]
(arXiv 2021.11) ADAPTIVE FOURIER NEURAL OPERATORS: EFFICIENT TOKEN MIXERS FOR TRANSFORMERS, [Paper]
(arXiv 2021.11) DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion, [Paper], [Code]
(arXiv 2021.11) DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, [Paper], [Code]
(arXiv 2021.11) Ice hockey player identification via transformers, [Paper]
(arXiv 2021.11) DBIA: Data-free Backdoor Injection Attack against Transformer Networks, [Paper], [Code]
(arXiv 2021.11) Sparse Fusion for Multimodal Transformers, [Paper]
(arXiv 2021.11) PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer, [Paper], [Code]
(arXiv 2021.11) Self-Supervised Pre-Training for Transformer-Based Person Re-Identification, [Paper], [Code]
(arXiv 2021.11) DISCRETE REPRESENTATIONS STRENGTHEN VISION TRANSFORMER ROBUSTNESS, [Paper]
(arXiv 2021.11) TRAVLR: Now You See It, Now You Don’t! Evaluating Cross-Modal Transfer of Visio-Linguistic Reasoning, [Paper]
(arXiv 2021.11) Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling, [Paper]
(arXiv 2021.11) Semi-Supervised Vision Transformers, [Paper]
(arXiv 2021.11) CpT: Convolutional Point Transformer for 3D Point Cloud Processing, [Paper]
(arXiv 2021.11) ZERO-SHOT CERTIFIED DEFENSE AGAINST ADVERSARIAL PATCHES WITH VISION TRANSFORMERS, [Paper]
(arXiv 2021.11) PointMixer: MLP-Mixer for Point Cloud Understanding, [Paper]
(arXiv 2021.11) MetaFormer is Actually What You Need for Vision, [Paper], [Code]
(arXiv 2021.11) Florence: A New Foundation Model for Computer Vision, [Paper]
(arXiv 2021.11) Benchmarking Detection Transfer Learning with Vision Transformers, [Paper]
(arXiv 2021.11) Learning to Compose Visual Relations, [Paper], [Project]
(arXiv 2021.11) REFERENCE-BASED MAGNETIC RESONANCE IMAGE RECONSTRUCTION USING TEXTURE TRANSFORMER, [Paper]
(arXiv 2021.11) Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval, [Paper]
(arXiv 2021.11) Swin Transformer V2: Scaling Up Capacity and Resolution, [Paper], [Code]
(arXiv 2021.11) SimMIM: A Simple Framework for Masked Image Modeling, [Paper], [Code]
(arXiv 2021.11) Restormer: Efficient Transformer for High-Resolution Image Restoration, [Paper], [Code]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) ClipCap: CLIP Prefix for Image Captioning, [Paper], [Code]
(arXiv 2021.11) TransMix: Attend to Mix for Vision Transformers, [Paper], [Code]
(arXiv 2021.11) TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance, [Paper]
(arXiv 2021.11) Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts, [Paper], [Code]
(arXiv 2021.11) Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning, [Paper], [Code]
(arXiv 2021.11) Semantically Grounded Object Matching for Robust Robotic Scene Rearrangement, [Paper], [Code]
(arXiv 2021.11) Tracking People with 3D Representations, [Paper], [Code]
(arXiv 2021.11) LiT: Zero-Shot Transfer with Locked-image Text Tuning, [Paper]
(arXiv 2021.11) FILIP: FINE-GRAINED INTERACTIVE LANGUAGE-IMAGE PRE-TRAINING, [Paper]
(arXiv 2021.11) Graph Relation Transformer: Incorporating pairwise object features into the Transformer architecture, [Paper], [Code]
(arXiv 2021.11) Attention Approximates Sparse Distributed Memory, [Paper]
(arXiv 2021.11) SLICED RECURSIVE TRANSFORMER, [Paper], [Code]
(arXiv 2021.11) HYBRID BYOL-VIT: EFFICIENT APPROACH TO DEAL WITH SMALL DATASETS, [Paper]
(arXiv 2021.11) Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling, [Paper], [Code]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis, [Paper], [Code]
(arXiv 2021.11) Revisiting spatio-temporal layouts for compositional action recognition, [Paper], [Code]
(arXiv 2021.11) PatchGame: Learning to Signal Mid-level Patches in Referential Games, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) Livestock Monitoring with Transformer, [Paper]
(arXiv 2021.11) With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, [Paper], [Code]
(arXiv 2021.11) IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning, [Paper], [Project]
(arXiv 2021.11) BoxeR: Box-Attention for 2D and 3D Transformers, [Paper]
(arXiv 2021.11) VLDeformer: Vision-Language Decomposed Transformer for Fast Cross-Modal Retrieval, [Paper]
(arXiv 2021.11) Multi-Person 3D Motion Prediction with Multi-Range Transformers, [Paper], [Code]
(arXiv 2021.11) Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations, [Paper], [Project]
(arXiv 2021.11) Global Interaction Modelling in Vision Transformer via Super Tokens, [Paper]
(arXiv 2021.11) ML-Decoder: Scalable and Versatile Classification Head, [Paper], [Code]
(arXiv 2021.11) Exploiting Both Domain-specific and Invariant Knowledge via a Win-win Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.11) SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning, [Paper]
(arXiv 2021.11) Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization, [Paper]
(arXiv 2021.11) Universal Captioner: Long-Tail Vision-and-Language Model Training through Content-Style Separation, [Paper]
(arXiv 2021.11) Sparse is Enough in Scaling Transformers, [Paper]
(arXiv 2021.11) An implementation of the “Guess who?” game using CLIP, [Paper], [Code]
(arXiv 2021.11) HEAT: Holistic Edge Attention Transformer for Structured Reconstruction, [Paper]
(arXiv 2021.11) A Unified Pruning Framework for Vision Transformers, [Paper]
(arXiv 2021.11) Pyramid Adversarial Training Improves ViT Performance, [Paper]
(arXiv 2021.11) AssistSR: Affordance-centric Question-driven Video Segment Retrieval, [Paper], [Code & Data]
(arXiv 2021.11) DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation, [Paper], [Code]
(arXiv 2021.11) , [Paper]
(arXiv 2021.11) AdaViT: Adaptive Vision Transformers for Efficient Image Recognition, [Paper]
(arXiv 2021.11) ATS: Adaptive Token Sampling For Efficient Vision Transformers, [Paper]
(arXiv 2021.11) CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning, [Paper]
(arXiv 2021.11) CRIS: CLIP-Driven Referring Image Segmentation, [Paper]
(arXiv 2021.11) Shunted Self-Attention via Multi-Scale Token Aggregation, [Paper], [Code]
(arXiv 2021.11) MC-SSL0.0: Towards Multi-Concept Self-Supervised Learning, [Paper]
(arXiv 2021.11) TransWeather: Transformer-based Restoration of Images Degraded by Adverse Weather Conditions, [Paper], [Code]
(arXiv 2021.11) Searching the Search Space of Vision Transformer, [Paper], [Code]
(arXiv 2021.11) TransMVSNet: Global Context-aware Multi-view Stereo Network with Transformers, [Paper], [Code]
(arXiv 2021.11) Recurrent Vision Transformer for Solving Visual Reasoning Problems, [Paper]
(arXiv 2021.11) Video Frame Interpolation Transformer, [Paper]
(arXiv 2021.11) FQ-ViT: Fully Quantized Vision Transformer without Retraining, [Paper], [Code]
(arXiv 2021.11) LAFITE : Towards Language-Free Training for Text-to-Image Generation, [Paper]
(arXiv 2021.11) SPARSE DETR: EFFICIENT END-TO-END OBJECT DETECTION WITH LEARNABLE SPARSITY, [Paper], [Code]
(arXiv 2021.11) End-to-End Referring Video Object Segmentation with Multimodal Transformers, [Paper], [Code]
(arXiv 2021.11) Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling, [Paper], [Code]
(arXiv 2021.11) Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic, [Paper], [Code]
(arXiv 2021.11) Blended Diffusion for Text-driven Editing of Natural Images, [Paper], [Code]
(arXiv 2021.11) Mask Transfiner for High-Quality Instance Segmentation, [Paper], [Code]
(arXiv 2021.11) MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.11) PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers, [Paper], [Code]
(arXiv 2021.11) Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes, [Paper], [COde]
(arXiv 2021.11) Towards Tokenized Human Dynamics Representation, [Paper], [Code]
(arXiv 2021.11) Self-slimmed Vision Transformer, [Paper]
(arXiv 2021.11) VIOLET: End-to-End Video-Language Transformers with Masked Visual-token Modeling, [Paper], [Code]
(arXiv 2021.11) A Lightweight Graph Transformer Network for Human Mesh Reconstruction from 2D Human Pose, [Paper]
(arXiv 2021.11) MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video, [Paper]
(arXiv 2021.11) Octree Transformer: Autoregressive 3D Shape Generation on Hierarchically Structured Sequences, [Paper]
(arXiv 2021.11) Hierarchical Modular Network for Video Captioning, [Paper]
(arXiv 2021.11) NU¨WA: Visual Synthesis Pre-training for Neural visUal World creAtion, [Paper], [Code]
(arXiv 2021.11) An Image Patch is a Wave: Phase-Aware Vision MLP, [Paper]
(arXiv 2021.11) PTQ4ViT: Post-Training Quantization Framework for Vision Transformers, [Paper]
(arXiv 2021.11) PU-Transformer: Point Cloud Upsampling Transformer, [Paper]
(arXiv 2021.11) Scaling Up Vision-Language Pre-training for Image Captioning, [Paper]
(arXiv 2021.11) Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing, [Paper], [Code]
(arXiv 2021.11) Efficient Video Transformers with Spatial-Temporal Token Selection, [Paper]
(arXiv 2021.11) RedCaps: Web-curated image-text data created by the people, for the people, [Paper], [Project]
(arXiv 2021.11) EMScore: Evaluating Video Captioning via Coarse-Grained and Fine-Grained Embedding Matching, [Paper], [Code]
(arXiv 2021.11) Compositional Transformers for Scene Generation, [Paper], [Code]
(arXiv 2021.11) Vis-TOP: Visual Transformer Overlay Processor, [Paper]
(arXiv 2021.11) Grounded Situation Recognition with Transformers, [Paper], [Code]
(arXiv 2021.11) Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints, [Paper]
(arXiv 2021.11) UFO: A UniFied TransfOrmer for Vision-Language Representation Learning, [Paper]
(arXiv 2021.11) Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions, [Paper]
(arXiv 2021.11) Combined Scaling for Zero-shot Transfer Learning, [Paper]
(arXiv 2021.11) Simple but Effective: CLIP Embeddings for Embodied AI, [Paper]
(arXiv 2021.11) Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding, [Paper]
(arXiv 2021.11) IBOT: IMAGE BERT PRE-TRAINING WITH ONLINE TOKENIZER, [Paper], [Code]
(arXiv 2021.11) Masked Autoencoders Are Scalable Vision Learners, [Paper]
(arXiv 2021.11) Mask-guided Spectral-wise Transformer for Efficient Hyperspectral Image Reconstruction, [Paper]
(arXiv 2021.11) Are Transformers More Robust Than CNNs?, [Paper], [Code]
(arXiv 2021.11) CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval, [Paper]
(arXiv 2021.11) Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, [Paper]
(arXiv 2021.11) Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers, [Paper]
(arXiv 2021.11) VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, [Paper], [Code]
(arXiv 2021.11) LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, [Paper], [Project]
(arXiv 2021.11) An Empirical Study of Training End-to-End Vision-and-Language Transformers, [Paper], [Code]
(arXiv 2021.11) CAN VISION TRANSFORMERS PERFORM CONVOLUTION? [Paper]
(arXiv 2021.11) HRViT: Multi-Scale High-Resolution Vision Transformer, [Paper]

2021.10

(arXiv 2021.10) Visual Keyword Spotting with Attention, [Paper], [[Project]](Visual Keyword Spotting with Attention)
(arXiv 2021.10) Learning Co-segmentation by Segment Swapping for Retrieval and Discovery, [Paper], [Data & Code]
(arXiv 2021.10) Visual Spatio-Temporal Relation-Enhanced Network for Cross-Modal Text-Video Retrieval, [Paper], [Code]
(arXiv 2021.10) Dispensed Transformer Network for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.10) Scatterbrain: Unifying Sparse and Low-rank Attention Approximation, [Paper]
(arXiv 2021.10) 3D Object Tracking with Transformer, [Paper], [Code]
(arXiv 2021.10) Blending Anti-Aliasing into Vision Transformer, [Paper], [Code]
(arXiv 2021.10) UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model, [Paper], [Data & Code]
(arXiv 2021.10) SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation, [Paper]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) History Aware Multimodal Transformer for Vision-and-Language Navigation, [Paper], [Project]
(arXiv 2021.10) TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation, [Paper]
(arXiv 2021.10) TNTC: TWO-STREAM NETWORK WITH TRANSFORMER-BASED COMPLEMENTARITY FOR GAIT-BASED EMOTION RECOGNITION, [Paper]
(arXiv 2021.10) Contextual Similarity Aggregation with Self-attention for Visual Re-ranking, [Paper], [Code]
(arXiv 2021.10) IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action Recognition, [Paper], [Code]
(arXiv 2021.10) IMAGE-BASED CLIP-GUIDED ESSENCE TRANSFER, [Paper], [Code]
(arXiv 2021.10) Sinkformers: Transformers with Doubly Stochastic Attention, [Paper]
(arXiv 2021.10) ILLITERATE DALL·E LEARNS TO COMPOSE, [Paper], [Project], [Code]
(arXiv 2021.10) Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering, [Paper]
(arXiv 2021.10) SOFT: Softmax-free Transformer with Linear Complexity, [Paper], [Code]
(arXiv 2021.10) Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation, [Paper]
(arXiv 2021.10) TRANSFORMER ACCELERATION WITH DYNAMIC SPARSE ATTENTION, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization, [Paper]
(arXiv 2021.10) StructFormer: Learning Spatial Structure for Language-Guided Semantic Rearrangement of Novel Objects, [Paper], [Project]
(arXiv 2021.10) Gophormer: Ego-Graph Transformer for Node Classification, [Paper]
(arXiv 2021.10) STRANSGAN: AN EMPIRICAL STUDY ON TRANSFORMER IN GANS, [Paper], [Code]
(arXiv 2021.10) MVT: Multi-view Vision Transformer for 3D Object Recognition, [Paper]
(arXiv 2021.10) DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction, [Paper], [Code]
(arXiv 2021.10) Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network, [Paper]
(arXiv 2021.10) WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP, [Paper], [Code]
(arXiv 2021.10) AFTer-UNet: Axial Fusion Transformer UNet for Medical Image Segmentation, [Paper]
(arXiv 2021.10) CLOOB: MODERN HOPFIELD NETWORKS WITH INFOLOOB OUTPERFORM CLIP, [Paper], [Code]
(arXiv 2021.10) AniFormer: Data-driven 3D Animation with Transformer, [Paper], [Code]
(arXiv 2021.10) Few-Shot Temporal Action Localization with Query Adaptive Transformer, [Paper], [Code]
(arXiv 2021.10) 3D-ANAS v2: Grafting Transformer Module on Automatically Designed ConvNet for Hyperspectral Image Classification, [Paper], [Code]
(arXiv 2021.10) CMTR: Cross-modality Transformer for Visible-infrared Person Re-identification, [Paper]
(arXiv 2021.10) 3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers, [Paper], [Code]
(arXiv 2021.10) HRFormer: High-Resolution Transformer for Dense Prediction, [Paper], [Code]
(arXiv 2021.10) Leveraging MoCap Data for Human Mesh Recovery, [Paper]
(arXiv 2021.10) A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models, [Paper]
(arXiv 2021.10) ASFormer: Transformer for Action Segmentation, [Paper], [Code]
(arXiv 2021.10) Multimodal Dialogue Response Generation, [Paper]
(arXiv 2021.10) Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals, [Paper]
(arXiv 2021.10) COMPOSITIONAL ATTENTION: DISENTANGLING SEARCH AND RETRIEVAL, [Paper], [Code]
(arXiv 2021.10) Spatial-Temporal Transformer for 3D Point Cloud Sequences, [Paper]
(arXiv 2021.10) TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) Unifying Multimodal Transformer for Bi-directional Image and Text Generation, [Paper]
(arXiv 2021.10) Transformer with a Mixture of Gaussian Keys, [Paper]
(arXiv 2021.10) DIFFUSIONCLIP: TEXT-GUIDED IMAGE MANIPULATION USING DIFFUSION MODELS, [Paper]
(arXiv 2021.10) Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs, [Paper], [Code]
(arXiv 2021.10) RIPPLE ATTENTION FOR VISUAL PERCEPTION WITH SUB-QUADRATIC COMPLEXITY, [Paper]
(arXiv 2021.10) Certified Patch Robustness via Smoothed Vision Transformers, [Paper], [Code]
(arXiv 2021.10) CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation, [Paper]
(arXiv 2021.10) Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation, [Paper]
(arXiv 2021.10) SPARSE MOES MEET EFFICIENT ENSEMBLES, [Paper]
(arXiv 2021.10) Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? [Paper]
(arXiv 2021.10) SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition, [Paper]
(arXiv 2021.10) Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning, [Paper]
(arXiv 2021.10) Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block, [Paper]
(arXiv 2021.10) SUPERVISION EXISTS EVERYWHERE: A DATA EFFICIENT CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING PARADIGM, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption ++: Multi-CLIP for Video Caption, [Paper]
(arXiv 2021.10) Transformer-based Dual Relation Graph for Multi-label Image Recognition, [Paper]
(arXiv 2021.10) VECTOR-QUANTIZED IMAGE MODELING WITH IMPROVED VQGAN, [Paper]
(arXiv 2021.10) Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation, [Paper], [Code]
(arXiv 2021.10) NVIT: VISION TRANSFORMER COMPRESSION AND PARAMETER REDISTRIBUTION, [Paper]
(arXiv 2021.10) 6D-ViT: Category-Level 6D Object Pose Estimation via Transformer-based Instance Representation Learning, [Paper]
(arXiv 2021.10) CLIP-Adapter: Better Vision-Language Models with Feature Adapters, [Paper], [Code]
(arXiv 2021.10) ATISS: Autoregressive Transformers for Indoor Scene Synthesis, [Paper], [Code] ，
(arXiv 2021.10) MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TRANSFORMER, [Paper]
(arXiv 2021.10) TOKEN POOLING IN VISION TRANSFORMERS, [Paper]
(arXiv 2021.10) VIDT: AN EFFICIENT AND EFFECTIVE FULLY TRANSFORMER-BASED OBJECT DETECTOR, [Paper], [Code]
(arXiv 2021.10) CLIP4Caption: CLIP for Video Caption, [Paper]
(arXiv 2021.10) OBJECT-REGION VIDEO TRANSFORMERS, [Paper], [Code]
(arXiv 2021.10) LEVERAGING REDUNDANCY IN ATTENTION WITH REUSE TRANSFORMERS, [Paper]
(arXiv 2021.10) Dynamic Inference with Neural Interpreters, [Paper]
(arXiv 2021.10) A CLIP-Enhanced Method for Video-Language Understanding, [Paper]
(arXiv 2021.10) Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries, [Paper]
(arXiv 2021.10) Discovering Human Interactions with Large-Vocabulary Objects via Query and Multi-Scale Detection, [Paper]
(arXiv 2021.10) Learning Structural Representations for Recipe Generation and Food Retrieval, [Paper]
(arXiv 2021.10) A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION, [Paper]

2021.09

(arXiv 2021.09) Joint Multimedia Event Extraction from Video and Article, [Paper]
(arXiv 2021.09) Long-Range Transformers for Dynamic Spatiotemporal Forecasting, [Paper]
(arXiv 2021.09) Visually Grounded Concept Composition, [Paper]
(arXiv 2021.09) CoSeg: Cognitively Inspired Unsupervised Generic Event Segmentation, [Paper]
(arXiv 2021.09) CCTrans: Simplifying and Improving Crowd Counting with Transformer, [Paper]
(arXiv 2021.09) UFO-ViT: High Performance Linear Vision Transformer without Softmax, [Paper]
(arXiv 2021.09) Infrared Small-Dim Target Detection with Transformer under Complex Backgrounds, [Paper]
(arXiv 2021.09) Localizing Objects with Self-Supervised Transformers and no Labels, [Paper], [Code]
(arXiv 2021.09) Geometry-Entangled Visual Semantic Transformer for Image Captioning, [Paper]
(arXiv 2021.09) VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding, [Paper], [Code]
(arXiv 2021.09) Fine-tuning Vision Transformers for the Prediction of State Variables in Ising Models, [Paper]
(arXiv 2021.09) CLIP-It! Language-Guided Video Summarization, [Paper], [Project]
(arXiv 2021.09) MFEVIT: A ROBUST LIGHTWEIGHT TRANSFORMER-BASED NETWORK FOR MULTIMODAL 2D+3D FACIAL EXPRESSION RECOGNITION, [Paper]
(arXiv 2021.09) Sparse Spatial Transformers for Few-Shot Learning, [Paper], [Code]
(arXiv 2021.09) Vision Transformer Hashing for Image Retrieval, [Paper]
(arXiv 2021.09) PETA: Photo Albums Event Recognition using Transformers Attention, [Paper]
(arXiv 2021.09) MLIM: VISION-AND-LANGUAGE MODEL PRE-TRAINING WITH MASKED LANGUAGE AND IMAGE MODELING, [Paper]
(arXiv 2021.09) Dense Contrastive Visual-Linguistic Pretraining, [Paper]
(arXiv 2021.09) CPT: COLORFUL PROMPT TUNING FOR PRE-TRAINED VISION-LANGUAGE MODELS, [Paper]
(arXiv 2021.09) Localizing ∞-shaped fishes: Sketch-guided object localization in the wild, [Paper], [Code]
(arXiv 2021.09) CLIPORT: What and Where Pathways for Robotic Manipulation, [Paper], [Project], [Code]
(arXiv 2021.09) GraFormer: Graph Convolution Transformer for 3D Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation, [Paper]
(arXiv 2021.09) Expression Snippet Transformer for Robust Video-based Facial Expression Recognition, [Paper], [Code]
(arXiv 2021.09) LOTR: Face Landmark Localization Using Localization Transformer, [Paper]
(arXiv 2021.09) Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions, [Paper]
(arXiv 2021.09) SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction, [Paper]
(arXiv 2021.09) KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation, [Paper]
(arXiv 2021.09) T6D-Direct: Transformers for Multi-Object 6D Pose Direct Regression, [Paper]
(arXiv 2021.09) OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification, [Paper]
(arXiv 2021.09) PIX2SEQ: A LANGUAGE MODELING FRAMEWORK FOR OBJECT DETECTION, [Paper]
(arXiv 2021.09) ActionCLIP: A New Paradigm for Video Action Recognition, [Paper]
(arXiv 2021.09) BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation, [Paper]
(arXiv 2021.09) Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering, [Paper], [Code]
(arXiv 2021.09) Anchor DETR: Query Design for Transformer-Based Detector, [Paper], [Code]
(arXiv 2021.09) An End-to-End Transformer Model for 3D Object Detection, [Paper], [Code]
(arXiv 2021.09) Hybrid Local-Global Transformer for Image Dehazing, [Paper]
(arXiv 2021.09) Semi-Supervised Wide-Angle Portraits Correction by Multi-Scale Transformer, [Paper]
(arXiv 2021.09) Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning, [Paper]
(arXiv 2021.09) Pose Transformers (POTR): Human Motion Prediction with Non-Autoregressive Transformers, [Paper], [Code]
(arXiv 2021.09) PnP-DETR: Towards Efficient Visual Analysis with Transformers, [Paper], [Code]
(arXiv 2021.09) Learning to Ground Visual Objects for Visual Dialog, [Paper]
(arXiv 2021.09) On Pursuit of Designing Multi-modal Transformer for Video Grounding, [Paper], [Code]
(arXiv 2021.09) CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation, [Paper]
(arXiv 2021.09) IS ATTENTION BETTER THAN MATRIX DECOMPOSITION? [Paper], [Code]
(arXiv 2021.09) Temporal Pyramid Transformer with Multimodal Interaction for Video Question Answering, [Paper]
(arXiv 2021.09) Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization, [Paper]
(arXiv 2021.09) Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding, [Paper]
(arXiv 2021.09) LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation, [Paper], [Code]
(arXiv 2021.09) Panoptic Narrative Grounding, [Paper]
(arXiv 2021.09) An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA, [Paper]
(arXiv 2021.09) PlaTe: Visually-Grounded Planning with Transformers in Procedural Tasks, [Paper], [Project]
(arXiv 2021.09) EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling, [Paper]
(arXiv 2021.09) Scaled ReLU Matters for Training Vision Transformers, [Paper]
(arXiv 2021.09) FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, [Paper], [Code]
(arXiv 2021.09) GCsT: Graph Convolutional Skeleton Transformer for Action Recognition, [Paper]
(arXiv 2021.09) WHYACT: Identifying Action Reasons in Lifestyle Vlogs, [Paper]
(arXiv 2021.09) Zero-Shot Open Set Detection by Extending CLIP, [Paper]
(arXiv 2021.09) Towards Transferable Adversarial Attacks on Vision Transformers, [Paper]
(arXiv 2021.09) Learning to Prompt for Vision-Language Models, [Paper], [Code]
(arXiv 2021.09) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss, [Paper], [Code]
(arXiv 2021.09) UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer, [Paper], [Code]
(arXiv 2021.09) ConvMLP: Hierarchical Convolutional MLPs for Vision, [Paper], [Code]
(arXiv 2021.09) TxT: Crossmodal End-to-End Learning with Transformers, [Paper]
(arXiv 2021.09) Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers, [Paper]
(arXiv 2021.09) Sparse-MLP: A Fully-MLP Architecture with Conditional Computation, [Paper]
(arXiv 2021.09) SORNet: Spatial Object-Centric Representations for Sequential Manipulation, [Paper], [Project]
(arXiv 2021.09) Audio-Visual Transformer Based Crowd Counting, [Paper]
(arXiv 2021.09) Weakly Supervised Relative Spatial Reasoning for Visual Question Answering, [Paper], [Code]
(arXiv 2021.09) FUSFORMER: A TRANSFORMER-BASED FUSION APPROACH FOR HYPERSPECTRAL IMAGE SUPER-RESOLUTION, [Paper]
(arXiv 2021.09) CTRL-C: Camera calibration TRansformer with Line-Classification, [Paper], [Code]
(arXiv 2021.09) Learning to Generate Scene Graph from Natural Language Supervision, [Paper], [Code]
(arXiv 2021.09) The Animation Transformer: Visual Correspondence via Segment Matching, [Paper]
(arXiv 2021.09) Voxel Transformer for 3D Object Detection, [Paper]
(ICCV 2021.09) 3D Human Texture Estimation from a Single Image with Transformers, [Paper], [Code]
(arXiv 2021.09) Encoder-decoder with Multi-level Attention for 3D Human Shape and Pose Estimation, [Paper], [Code]
(arXiv 2021.09) Joint Graph Learning and Matching for Semantic Feature Correspondence, [Paper]
(arXiv 2021.09) Searching for Efficient Multi-Stage Vision Transformers, [Paper], [Code]

2021.08

(arXiv 2021.08) SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation, [Paper]
(arXiv 2021.08) GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal Transformer, [Paper], [Code]
(arXiv 2021.08) A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP, [Paper]
(arXiv 2021.08) Exploring and Improving Mobile Level Vision Transformers, [Paper]
(arXiv 2021.08) Cross-category Video Highlight Detection via Set-based Learning, [Paper], [Code]
(arXiv 2021.08) Shifted Chunk Transformer for Spatio-Temporal Representational Learning, [Paper]
(arXiv 2021.08) SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments, [Paper]
(arXiv 2021.08) LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision, [Paper], [Project]
(arXiv 2021.08) Guiding Query Position and Performing Similar Attention for Transformer-Based Detection Heads, [Paper]
(arXiv 2021.08) SIMVLM: SIMPLE VISUAL LANGUAGE MODEL PRETRAINING WITH WEAK SUPERVISION, [Paper]
(arXiv 2021.08) TransFER: Learning Relation-aware Facial Expression Representations with Transformers, [[Paper]](https://

Name		Name	Last commit message	Last commit date
Latest commit History 665 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient-Transformers-in-Vision: A Survey

Resource

Survey

Recent Papers

2022.07

2022.06

2022.05

2022.04

2022.03

2022.02

2022.01

2021.12

2021.11

2021.10

2021.09

2021.08

About

Releases

Packages

csjunxu/Efficient-Transformers-in-Vision

Folders and files

Latest commit

History

Repository files navigation

Efficient-Transformers-in-Vision: A Survey

Resource

Survey

Recent Papers

2022.07

2022.06

2022.05

2022.04

2022.03

2022.02

2022.01

2021.12

2021.11

2021.10

2021.09

2021.08

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages