This is the official repository of "Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models", a comprehensive survey of recent progress in VLN with foundation models.
Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges. We hope our in-depth discussions could provide valuable resources and insights: on the one hand, to document the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.
If you find our work useful in your research, please consider citing:
@article{zhang2024vision,
title={Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models},
author={Zhang, Yue and Ma, Ziqiao and Li, Jialu and Qiao, Yanyuan and Wang, Zun and Chai, Joyce and Wu, Qi and Bansal, Mohit and Kordjamshidi, Parisa},
journal={arXiv preprint arXiv:2407.07035},
year={2024}
}
π We will update this page frequently. If you believe additional work should be included, please do not hesitate to email us ([email protected]) or raise an issue. Your suggestions and comments are invaluable to ensuring the completeness of our resources.
Title | Venue | Date | Code |
---|---|---|---|
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions | ACL | 2022 | Github |
Visual language navigation: A survey and open challenges | - | 2023 | - |
Vision-Language Navigation: A Survey and Taxonomy | - | 2021 | - |
A world model helps the VLN agent to understand their surrounding environments, predict how their actions would change the world state, and align their perception and actions with language instructions.
The human model comprehends human-provided natural language instructions per situation to complete navigation tasks.
Title | Venue | Date | Code |
---|---|---|---|
Diagnosing Vision-and-Language Navigation: What Really Matters | NACCL | 2022 | Github |
Behavioral Analysis of Vision-and-Language Navigation Agents | CVPR | 2023 | Github |
Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation | EMNLP Findings | 2024 | Github |