Skip to content

Latest commit

 

History

History
537 lines (352 loc) · 23.6 KB

File metadata and controls

537 lines (352 loc) · 23.6 KB

Awesome Vision-and-Language Navigation

This repo keeps track of the recent advances in Vision-and-Language Navigation research. Please check out our ACL 2022 VLN survey paper for the catogerization approach and the detailed discussions of tasks, methods, and future directions: Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions.

A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from natural language processing, computer vision, robotics, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.

Datasets and Benchmarks

Initial Instruction

  • [R2R]: Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    CVPR 2018 paper

  • [CHAI]: Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
    EMNLP 2018 paper

  • [LANI]: Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
    EMNLP 2018 paper

  • Following High-level Navigation Instructions on a Simulated Quadcopter with Imitation Learning
    RSS 2018 paper

  • [RoomNav]: Building Generalizable Agents with a Realistic and Rich 3D Environment
    arXiv 2018 paper

  • [EmbodiedQA]: Embodied Question Answering
    CVPR 2018 paper

  • [IQA]: Iqa: Visual Question Answering in Interactive Environments
    CVPR 2018 paper

  • [Room-for-Room] Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
    ACL 2019 paper

  • [XL-R2R] Cross-Lingual Vision-Language Navigation
    arXiv 2019 paper

  • [Touchdown]: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
    CVPR 2019 paper

  • The Streetlearn Environment and Dataset
    arXiv 2019 paper

  • Learning To Follow Directions in Street View
    arXiv 2019 paper

  • [Room-Across-Room]: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding EMNLP 2020 paper

  • [VLNCE] Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
    ECCV 2020 paper

  • [Retouchdown]: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View
    Spatial Language Understanding Workshop 2020 paper

  • [REVERIE]: Remote Embodied Visual Referring Expression in Real Indoor Environments
    CVPR 2020 paper

  • [ALFRED]: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
    CVPR 2020 paper

  • [Landmark-RxR]: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
    NeurIPS 2021 paper

  • Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
    ICRA 2021 [Project Page] [arXiv] [GitHub]

  • [Talk2Nav]: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory
    IJCV 2021 paper

  • [Habitat-Matterport]: 1000 Large-scale 3D Environments for Embodied AI
    Neurips 2021 paper

  • [SOON]: Scenario Oriented Object Navigation with Graph-based Exploration
    CVPR 2021 paper

  • [ZInD]: Zillow Indoor Dataset: Annotated Floor Plans With 360o Panoramas and 3D Room Layouts
    CVPR 2021 paper

Guidance

  • [VNLA]: Vision-based Navigation with Language-based Assistance via Imitation Learning with Indirect Intervention
    CVPR 2019 paper

  • [HANNA]: Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
    EMNLP 2019 paper

  • [CEREALBAR]: Executing Instructions in Situated Collaborative Interactions
    ACL 2019 paper

  • [Just Ask]: An Interactive Learning Framework for Vision and Language Navigation
    AAAI 2020 paper

Dialog

  • [Talk the Walk]: Navigating New York City through Grounded Dialogue
    arXiv 2018 paper

  • [CVDN]: Vision-and-Dialog Navigation
    CoRL 2019 paper

  • Collaborative Dialogue in Minecraft
    ACL 2019 paper

  • [RobotSlang]: The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation
    CoRL 2020 paper

  • [TEACh]: Task-driven Embodied Agents that Chat
    AAAI 2022 paper

  • [DialFRED]: Dialogue-enabled agents for embodied instruction following
    RA-L 2022 paper

  • [Don't Copy the Teacher]: EMNLP 2022 paper

  • [AVDN]: Aerial Vision-and-Dialog Navigation
    ACL 2023 paper

Evaluation

Here we introduce papers that includes new evaluation metrics.

  • On Evaluation of Embodied Navigation Agents
    arXiv 2018 paper

  • Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments
    CVPR 2019 paper

  • Vision-and-Dialog Navigation
    CoRL 2019 paper

  • Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
    ACL 2019 paper

  • General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
    arXiv 2019 paper

Methods

Representation Learning

Pretraining

  • Robust Navigation with Language Pretraining and Stochastic Sampling
    EMNLP 2019 paper

  • Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
    ECCV 2020 paper

  • Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
    ECCV 2020 paper

  • Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training
    CVPR 2020 paper

  • Episodic Transformer for Vision-and-Language Navigation
    ICCV 2021 paper

  • The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
    ICCV 2021 paper

  • A Recurrent Vision-and-Language BERT for Navigation
    CVPR 2021 paper

  • SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation
    CVPR 2021 paper

  • Airbert: In-domain Pretraining for Vision-and-Language Navigation
    ICCV 2021 paper

  • NDH-Full: Learning and Evaluating Navigational Agents on Full-Length Dialogue
    EMNLP 2021 paper

Semantic Understanding

  • Shifting the Baseline: Single Modality Performance on Visual Navigation & QA
    ACL 2019 paper

  • Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
    ACL 2019 paper

  • Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters
    BMVC 2019 paper

  • Diagnosing the Environment Bias in Vision-and-Language Navigation
    IJCAI 2020 paper

  • Object-and-Action Aware Model for Visual Language Navigation
    ECCV 2020 paper

  • Diagnosing Vision-and-Language Navigation: What Really Matters
    arXiv 2021 paper

  • Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression
    CVPR 2021 paper

  • Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
    IEEE CAS 2021 paper

  • SASRA: Semantically-aware Spatio-temporal Reasoning Agent for Vision-and-Language Navigation in Continuous Environments
    ICPR, 2022 [Paper] [Website] [Video]

  • FILM: Following Instructions in Language with Modular Methods
    ICLR 2022 [Paper] [Website] [Video] [Code]

  • Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue
    EMNLP 2022 [Paper] [Video]

Graph Representation

  • Chasing Ghosts: Instruction Following as Bayesian State Tracking
    NeurIPS 2019 paper

  • Language and Visual Entity Relationship Graph for Agent Navigation
    NeurIPS 2020 paper

  • Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
    NeurIPS 2020 paper

  • Topological Planning with Transformers for Vision-and-Language Navigation
    CVPR 2021 paper

Memory-augmented Model

  • Help, Anna! Visual Navigation with Natural Multimodal Assistance via Retrospective Curiosity-Encouraging Imitation Learning
    EMNLP 2019 paper

  • Vision-Dialog Navigation by Exploring Cross-modal Memory
    CVPR 2020 paper

  • A Recurrent Vision-and-Language BERT for Navigation
    CVPR 2021 paper

  • Scene-Intuitive Agent for Remote Embodied Visual Grounding
    CVPR 2021 paper

  • History Aware Multimodal Transformer for Vision-and-Language Navigation
    NeurIPS 2021 paper

Auxiliary Tasks

  • Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
    ICLR 2019 paper

  • Transferable Representation Learning in Vision-and-Language Navigation
    ICCV 2019 paper

  • Vision-Language Navigation with Self-Supervised Auxiliary Reasoning Tasks
    CVPR 2020 paper

Action Strategy Learning

Reinforcement Learning

  • Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
    ECCV 2018 paper

  • Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
    CVPR 2019 paper

  • Vision-language navigation policy learning and adaptation
    TPAMI 2020 paper

  • Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation
    ACL 2019 paper

  • General Evaluation for Instruction Conditioned Navigation using Dynamic Time Warping
    arXiv 2019 paper

  • Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation
    arXiv 2019 paper

  • From language to goals: Inverse reinforcement learning for vision-based instruction following.
    arXiv 2019 paper

  • Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision
    NeurIPS 2021 paper

  • Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning
    IEEE CAS 2021 paper

Exploration during Navigation

  • Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
    CVPR 2019 paper

  • Active Visual Information Gathering for Vision-Language Navigation
    ECCV 2020 paper

  • Pathdreamer: A World Model for Indoor Navigation
    ICCV 2021 paper

Navigation Planning

  • Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation
    ECCV 2018 paper

  • Chasing Ghosts: Instruction Following as Bayesian State Tracking
    NeurIPS 2019 paper

  • Generative Language-Grounded Policy in Vision-and-Language Navigation with Bayes' Rule
    ICLR 2020 papepr

  • Learning to Stop: A Simple yet Effective Approach to Urban Vision-Language Navigation
    EMNLP Findings 2020 paper

  • Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
    ICRA 2021 [Project Page] [arXiv] [GitHub]

  • Waypoint Models for Instruction-guided Navigation in Continuous Environments
    ICCV 2021 paper

  • Pathdreamer: A World Model for Indoor Navigation
    ICCV 2021 paper

  • Neighbor-view Enhanced Model for Vision and Language Navigation
    arXiv 2021 paper

  • Language-Aligned Waypoint (LAW) Supervision for Vision-and-Language Navigation in Continuous Environments
    EMNLP 2021 paper

  • One Step at a Time: Long-Horizon Vision-and-Language Navigation with Milestones
    arXiv 2022 paper

Asking for Help

  • CVDN: Vision-and-Dialog Navigation
    CoRL 2019 paper

  • Learning when and what to ask: a hierarchical reinforcement learning framework
    EMNLP 2019 paper

  • Just Ask:An Interactive Learning Framework for Vision and Language Navigation
    AAAI 2020 paper

  • RMM: A Recursive Mental Model for Dialog Navigation
    EMNLP Findings 2020 paper

  • Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation
    ICCV 2021 paper

  • TEACh: Task-driven Embodied Agents that Chat
    arXiv 2021 paper

  • A Framework for Learning to Request Rich and Contextually Useful Information from Humans
    arXiv 2021 paper

Data-centric Learning

Data Augmentation

  • Speaker-Follower Models for Vision-and-Language Navigation
    NeurIPS 2018 paper

  • Multi-modal Discriminative Model for Vision-and-Language Navigation
    SpLU&RoboNLP Workshop 2019 paper

  • Transferable Representation Learning in Vision-and-Language Navigation
    ICCV 2019 paper

  • Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
    NAACL 2019 paper

  • Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
    NAACL 2019 paper

  • Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
    ECCV 2020 paper

  • Counterfactual vision-and-language navigation: Unravelling the unseen
    NeurIPS 2020 paper

  • Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
    EACL 2021 paper

  • Vision-Language Navigation with Random Environmental Mixup
    ICCV 2021 paper

  • On the Evaluation of Vision-and-Language Navigation Instructions
    EACL 2021 paper

  • EnvEdit: Environment Editing for Vision-and-Language Navigation CVPR 2022 paper

  • AIGeN: An Adversarial Approach for Instruction Generation in VLN CVPRW 2024 paper

Curriculum Learning

  • BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
    ACL 2020 paper

  • Curriculum Learning for Vision-and-Language Navigation
    NeurIPS 2021 paper

Multitask Learning

  • Environment-agnostic Multitask Learning for Natural Language Grounded Navigation
    ECCV 2020 paper

  • Embodied Multimodal Multitask Learning
    IJCAI 2020 paper

Instruction Interpretation

  • Multi-View Learning for Vision-and-Language Navigation
    arXiv 2020 paper

  • Sub-Instruction Aware Vision-and-Language Navigation
    EMNLP 2020 paper

  • Look wide and interpret twice: Improving performance on interactive instructionfollowing tasks
    arXiv 2021 paper

Prior Exploration

  • Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
    CVPR 2019 paper

  • Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
    NAACL 2019 paper

  • Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation
    ACL 2019 paper

  • Counterfactual Vision-and-Language Navigation: Unravelling the Unseen
    NeurIPS 2020 paper

  • Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling
    CVPR 2020 paper

  • Topological Planning with Transformers for Vision-and-Language Navigation
    CVPR 2021 paper

  • Rethinking the Spatial Route Prior in Vision-and-Language Navigation
    arXiv 2021 paper

Related Areas

Using 2D MAPS environments

  • Learning to follow navigational directions
    ACL 2010 paper

  • Learning to interpret natural language navigation instructions from observations
    AAAI 2011 paper

  • Run through the streets: A new dataset and baseline models for realistic urban navigation
    EMNLP 2019 paper

Using synthetic environments

  • Walk the talk: Connecting language, knowledge, and action in route instructions
    AAAI 2006 paper

  • Learning to Interpret Natural Language Navigation Instructions from Observations
    AAAI 2011 paper

  • Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight
    PMLR 2020 paper

Visual Navigation

  • Target-driven visual navigation in indoor scenes using deep reinforcement learning
    ICRA 2017 paper

  • Learning to navigate
    MULEA 2019 paper

  • Learning to navigate in cities without a map
    NeurIPS 2019 paper

  • Deep Learning for Embodied Vision Navigation: A Survey
    arXiv 2021 paper

  • Self-Supervised Object Goal Navigation with In-Situ Finetuning
    IROS 2023 paper video

If you find this repo useful for your research, please cite

@InProceedings{jing2022vln,
      title={Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions}, 
      author={Jing Gu and Eliana Stefani and Qi Wu and Jesse Thomason and Xin Eric Wang},
      booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL)},
      year = {2022}
}