Skip to content

zhangyuejoslin/VLN-Survey-with-Foundation-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

92 Commits
Β 
Β 

Repository files navigation

Survey of Vision-and-Language Navigation

This is the official repository of "Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models", a comprehensive survey of recent progress in VLN with foundation models.

πŸ‘ Our Survey has been officially accepted by TMLR!!!

Introduction

Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on the one hand, to document the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2024vision,
  title={Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models},
  author={Zhang, Yue and Ma, Ziqiao and Li, Jialu and Qiao, Yanyuan and Wang, Zun and Chai, Joyce and Wu, Qi and Bansal, Mohit and Kordjamshidi, Parisa},
  journal={arXiv preprint arXiv:2407.07035},
  year={2024}
}

πŸ”” We will update this page frequently. If you believe additional work should be included, please do not hesitate to email us ([email protected]) or raise an issue. Your suggestions and comments are invaluable to ensuring the completeness of our resources.

Content



Relevant Surveys

Title Venue Date Code
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions ACL 2022 Github
Visual language navigation: A survey and open challenges - 2023 -
Vision-Language Navigation: A Survey and Taxonomy - 2021 -

World Model

A world model helps the VLN agent to understand their surrounding environments, predict how their actions would change the world state, and align their perception and actions with language instructions.

Title Venue Date Code
VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation AAAI 2024 -
Volumetric Environment Representation for Vision-Language Navigation CVPR 2024 Github
Vision Language Navigation with Knowledge-driven Environmental Dreamer IJCAI 2023 -
Frequency-enhanced Data Augmentation for Vision-and-Language Navigation NeurIPS 2023 Github
Frequency-Enhanced Data Augmentation for Vision-and-Language Navigation NeurIPS 2023 Github
Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation NeurIPS 2023 Github
Simple and Effective Synthesis of Indoor 3D Scenes AAAI 2023 Github
Learning Navigational Visual Representations with Semantic Map Supervision ICCV 2023 -
Learning vision-and-language navigation from youtube videos ICCV 2023 Github
GridMM: Grid Memory Map for Vision-and-Language Navigation ICCV 2023 Github
BEVBert: Multimodal Map Pre-training for Language-guided Navigation ICCV 2023 Github
Scaling Data Generation in Vision-and-Language Navigation ICCV 2023 Github
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning CVPR 2023 Github
EnvEdit: Environment Editing for Vision-and-Language Navigation CVPR 2022 Github
Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation ECCV 2022 Github
How Much Can CLIP Benefit Vision-and-Language Tasks? ICLR 2022 Github
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation CVPR 2022 Github
History Aware Multimodal Transformer for Vision-and-Language Navigation NeurIPS 2021 Github
Pathdreamer: A World Model for Indoor Navigation ICCV 2021 -
Episodic Transformer for Vision-and-Language Navigation ICCV 2021 -
Airbert: In-domain Pretraining for Vision-and-Language Navigation ICCV 2021 Github
Vision-Language Navigation with Random Environmental Mixup ICCV 2021 Github

Human Model: Interpreting and Communication with Humans

The human model comprehends human-provided natural language instructions per situation to complete navigation tasks.

Title Venue Date Code
Navigation Instruction Generation with BEV Perception and Large Language Models ECCV 2024 Github
Controllable Navigation Instruction Generation with Chain of Thought Prompting ECCV 2024 Github
Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation ACL 2024 Github
LLM as Copilot for Coarse-grained Vision-and-Language Navigation ECCV 2024 -
Correctable Landmark Discovery via Large Models for Vision-Language Navigation TPAMI 2024 Github
NavHint: Vision and Language Navigation Agent with a Hint Generator EACL 2024 Github
Learning to Follow and Generate Instructions for Language-Capable Navigation TPAMI 2023 -
A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning CVPR 2023 Dataset
Learning vision-and-language navigation from youtube videos ICCV 2023 Github
Lana: A Language-Capable Navigator for Instruction Following and Generation CVPR 2023 Github
KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation CVPR 2023 Github
PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation MM 2023 -
CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation - 2023 -
VLN-Trans: Translator for the Vision and Language Navigation Agent ACL 2023 Github
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration ACL 2022 Github
Less is More: Generating Grounded Navigation Instructions from Landmarks CVPR 2022 Github
On the Evaluation of Vision-and-Language Navigation Instructions EACL 2021 -
Do As I Can, Not As I Say:Grounding Language in Robotic Affordances - - Github

VLN Agent: Learning an Embodied Agent for Reasoning and Planning

Title Venue Date Code
Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation AAAI 2023 -
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation ICCV 2023 Github
Adaptive Zone-aware Hierarchical Planner for Vision-Language Navigation ICCV 2023 Github
Bird's-Eye-View Scene Graph for Vision-Language Navigation ICCV 2023 -
Masked Path Modeling for Vision-and-Language Navigation EMNLP Findings 2023 -
Improving Vision-and-Language Navigation by Generating Future-View Image Semantics CVPR 2023 Github
HOP+: History-Enhanced and Order-Aware Pre-Training for Vision-and-Language Navigation TPAMI 2023 -
Target-Driven Structured Transformer Planner for Vision-Language Navigation MM 2022 Github
HOP: History-and-Order Aware Pre-training for Vision-and-Language Navigation CVPR 2022 Github
LOViS: Learning Orientation and Visual Signals for Vision and Language Navigation COLING 2022 Github
Scene-Intuitive Agent for Remote Embodied Visual Grounding CVPR 2021 -
SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation NeurIPS 2021 -
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation ICCV 2021 Github
VLN BERT: A Recurrent Vision-and-Language BERT for Navigation CVPR 2021 Github
Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training CVPR 2020 Github

VLN-CE Agent

Title Venue Date Code
Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation CVPR 2024 Github
ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments PAMI 2024 Github
Narrowing the Gap between Vision and Action in Navigation MM 2024 -
BEVBert: Multimodal Map Pre-training for Language-guided Navigation ICCV 2023 Github
Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation CVPR 2022 Github
Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments ECCV 2020 Github

LLM/VLM-based VLN Agent

Zero-shot

Title Venue Date Code
Discuss Before Moving: Visual Language Navigation via Multi-expert Discussions ICRA 2024 Github
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation ACL 2024 Github
MC-GPT: Empowering Vision-and-LanguageNavigation with Memory Map and Reasoning Chains - 2024 -
InstructNav: Zero-shot System for Generic Instruction Navigation in Unexplored Environment - 2024 Github
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models CVPR 2023 -
March in Chat: Interactive Prompting for Remote Embodied Referring Expression ICCV 2023 Github
Vision and Language Navigation in the Real World via Online Visual Language Mapping - 2023 -
A2Nav: Action-Aware Zero-Shot Robot Navigation by Exploiting Vision-and-Language Ability of Foundation Models NeurIPS Workshop 2023 -
CLIP-Nav: Using CLIP for Zero-Shot Vision-and-Language Navigation - 2022 -

Fine-tuning

Title Venue Date Code
LangNav: Language as a Perceptual Representation for Navigation NACCL Findings 2024 Github
NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning - 2024 Github
Towards Learning a Generalist Model for Embodied Navigation CVPR 2024 Github
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models ECCV 2024 Github
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation RSS 2024 Github

Behavior Analysis of the VLN Agent

Title Venue Date Code
Diagnosing Vision-and-Language Navigation: What Really Matters NACCL 2022 Github
Behavioral Analysis of Vision-and-Language Navigation Agents CVPR 2023 Github
Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation EMNLP Findings 2024 Github

About

[TMLR 2024] repository for VLN with foundation models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published