Survey of Vision-and-Language Navigation

This is the official repository of "Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models", a comprehensive survey of recent progress in VLN with foundation models.

👏 Our Survey has been officially accepted by TMLR!!!

Introduction

Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on the one hand, to document the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2024vision,
  title={Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models},
  author={Zhang, Yue and Ma, Ziqiao and Li, Jialu and Qiao, Yanyuan and Wang, Zun and Chai, Joyce and Wu, Qi and Bansal, Mohit and Kordjamshidi, Parisa},
  journal={arXiv preprint arXiv:2407.07035},
  year={2024}
}

🔔 We will update this page frequently. If you believe additional work should be included, please do not hesitate to email us ([email protected]) or raise an issue. Your suggestions and comments are invaluable to ensuring the completeness of our resources.

Relevant Surveys

Title	Venue	Date	Code
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions	ACL	2022	Github
Visual language navigation: A survey and open challenges	-	2023	-
Vision-Language Navigation: A Survey and Taxonomy	-	2021	-

World Model

A world model helps the VLN agent to understand their surrounding environments, predict how their actions would change the world state, and align their perception and actions with language instructions.

Title	Venue	Date	Code
VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation	AAAI	2024	-
Volumetric Environment Representation for Vision-Language Navigation	CVPR	2024	Github
Vision Language Navigation with Knowledge-driven Environmental Dreamer	IJCAI	2023	-
Frequency-enhanced Data Augmentation for Vision-and-Language Navigation	NeurIPS	2023	Github
Frequency-Enhanced Data Augmentation for Vision-and-Language Navigation	NeurIPS	2023	Github
Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation	NeurIPS	2023	Github
Simple and Effective Synthesis of Indoor 3D Scenes	AAAI	2023	Github
Learning Navigational Visual Representations with Semantic Map Supervision	ICCV	2023	-
Learning vision-and-language navigation from youtube videos	ICCV	2023	Github
GridMM: Grid Memory Map for Vision-and-Language Navigation	ICCV	2023	Github
BEVBert: Multimodal Map Pre-training for Language-guided Navigation	ICCV	2023	Github
Scaling Data Generation in Vision-and-Language Navigation	ICCV	2023	Github
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning	CVPR	2023	Github
EnvEdit: Environment Editing for Vision-and-Language Navigation	CVPR	2022	Github
Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation	ECCV	2022	Github
How Much Can CLIP Benefit Vision-and-Language Tasks?	ICLR	2022	Github
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation	CVPR	2022	Github
History Aware Multimodal Transformer for Vision-and-Language Navigation	NeurIPS	2021	Github
Pathdreamer: A World Model for Indoor Navigation	ICCV	2021	-
Episodic Transformer for Vision-and-Language Navigation	ICCV	2021	-
Airbert: In-domain Pretraining for Vision-and-Language Navigation	ICCV	2021	Github
Vision-Language Navigation with Random Environmental Mixup	ICCV	2021	Github

Human Model: Interpreting and Communication with Humans

The human model comprehends human-provided natural language instructions per situation to complete navigation tasks.

Title	Venue	Date	Code
Navigation Instruction Generation with BEV Perception and Large Language Models	ECCV	2024	Github
Controllable Navigation Instruction Generation with Chain of Thought Prompting	ECCV	2024	Github
Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation	ACL	2024	Github
LLM as Copilot for Coarse-grained Vision-and-Language Navigation	ECCV	2024	-
Correctable Landmark Discovery via Large Models for Vision-Language Navigation	TPAMI	2024	Github
NavHint: Vision and Language Navigation Agent with a Hint Generator	EACL	2024	Github
Learning to Follow and Generate Instructions for Language-Capable Navigation	TPAMI	2023	-
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning	CVPR	2023	Dataset
Learning vision-and-language navigation from youtube videos	ICCV	2023	Github
Lana: A Language-Capable Navigator for Instruction Following and Generation	CVPR	2023	Github
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation	CVPR	2023	Github
PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation	MM	2023	-
CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation	-	2023	-
VLN-Trans: Translator for the Vision and Language Navigation Agent	ACL	2023	Github
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration	ACL	2022	Github
Less is More: Generating Grounded Navigation Instructions from Landmarks	CVPR	2022	Github
On the Evaluation of Vision-and-Language Navigation Instructions	EACL	2021	-
Do As I Can, Not As I Say:Grounding Language in Robotic Affordances	-	-	Github

VLN Agent: Learning an Embodied Agent for Reasoning and Planning

Title	Venue	Date	Code
Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation	AAAI	2023	-
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation	ICCV	2023	Github
Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation	ICCV	2023	Github
Bird's-Eye-View Scene Graph for Vision-Language Navigation	ICCV	2023	-
Masked Path Modeling for Vision-and-Language Navigation	EMNLP Findings	2023	-
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics	CVPR	2023	Github
HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation	TPAMI	2023	-
Target-Driven Structured Transformer Planner for Vision-Language Navigation	MM	2022	Github
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation	CVPR	2022	Github
LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation	COLING	2022	Github
Scene-Intuitive Agent for Remote Embodied Visual Grounding	CVPR	2021	-
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation	NeurIPS	2021	-
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation	ICCV	2021	Github
VLN BERT: A Recurrent Vision-and-Language BERT for Navigation	CVPR	2021	Github
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training	CVPR	2020	Github

VLN-CE Agent

Title	Venue	Date	Code
Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation	CVPR	2024	Github
ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments	PAMI	2024	Github
Narrowing the Gap between Vision and Action in Navigation	MM	2024	-
BEVBert: Multimodal Map Pre-training for Language-guided Navigation	ICCV	2023	Github
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation	CVPR	2022	Github
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments	ECCV	2020	Github

LLM/VLM-based VLN Agent

Zero-shot

Title	Venue	Date	Code
Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions	ICRA	2024	Github
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation	ACL	2024	Github
MC-GPT: Empowering Vision-and-LanguageNavigation with Memory Map and Reasoning Chains	-	2024	-
InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment	-	2024	Github
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models	CVPR	2023	-
March in Chat: Interactive Prompting for Remote Embodied Referring Expression	ICCV	2023	Github
Vision and Language Navigation in the Real World via Online Visual Language Mapping	-	2023	-
A2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models	NeurIPS Workshop	2023	-
CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation	-	2022	-

Fine-tuning

Title	Venue	Date	Code
LangNav: Language as a Perceptual Representation for Navigation	NACCL Findings	2024	Github
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning	-	2024	Github
Towards Learning a Generalist Model for Embodied Navigation	CVPR	2024	Github
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models	ECCV	2024	Github
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation	RSS	2024	Github

Behavior Analysis of the VLN Agent

Title	Venue	Date	Code
Diagnosing Vision-and-Language Navigation: What Really Matters	NACCL	2022	Github
Behavioral Analysis of Vision-and-Language Navigation Agents	CVPR	2023	Github
Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation	EMNLP Findings	2024	Github

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Survey of Vision-and-Language Navigation

👏 Our Survey has been officially accepted by TMLR!!!

Introduction

Citation

Content

Relevant Surveys

World Model

Human Model: Interpreting and Communication with Humans

VLN Agent: Learning an Embodied Agent for Reasoning and Planning

VLN-CE Agent

LLM/VLM-based VLN Agent

Zero-shot

Fine-tuning

Behavior Analysis of the VLN Agent

About

Releases

Packages

Contributors 2

zhangyuejoslin/VLN-Survey-with-Foundation-Models

Folders and files

Latest commit

History

Repository files navigation

Survey of Vision-and-Language Navigation

👏 Our Survey has been officially accepted by TMLR!!!

Introduction

Citation

Content

Relevant Surveys

World Model

Human Model: Interpreting and Communication with Humans

VLN Agent: Learning an Embodied Agent for Reasoning and Planning

VLN-CE Agent

LLM/VLM-based VLN Agent

Zero-shot

Fine-tuning

Behavior Analysis of the VLN Agent

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages