Francesco Taioli; Stefano Rosa; Alberto Castellini, Lorenzo Natale, Alessio Del Bue, Alessandro Farinelli, Marco Cristani, Yiming Wang
Accepted to IROS 24
contact: [email protected]
Important
Consider citing our paper:
@INPROCEEDINGS{taioli2024mind,
title={{Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation}},
author={Taioli, Francesco and Rosa, Stefano and Castellini, Alberto and Natale, Lorenzo and Del Bue, Alessio and Farinelli, Alessandro and Cristani, Marco and Wang, Yiming},
year={2024},
volume={},
number={},
pages={12993-13000},
booktitle={2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
doi={10.1109/IROS58592.2024.10801822}
}
Vision-and-Language Navigation in Continuous Environments (VLN-CE) is one of the most intuitive yet challenging embodied AI tasks. Agents are tasked to navigate towards a target goal by executing a set of low-level actions, following a series of natural language instructions. All VLN-CE methods in the literature assume that language instructions are exact. However, in practice, instructions given by humans can contain errors when describing a spatial environment due to inaccurate memory or confusion. Current VLN-CE benchmarks do not address this scenario, making the state-of-the-art methods in VLN-CE fragile in the presence of erroneous instructions from human users. For the first time, we propose a novel benchmark dataset that introduces various types of instruction errors considering potential human causes. This benchmark provides valuable insight into the robustness of VLN systems in continuous environments. We observe a noticeable performance drop (up to -25%) in Success Rate when evaluating the state-of-the-art VLN-CE methods on our benchmark. Moreover, we formally define the task of Instruction Error Detection and Localization, and establish an evaluation protocol on top of our benchmark dataset. We also propose an effective method, based on a cross-modal transformer architecture, that achieves the best performance in error detection and localization, compared to baselines. Surprisingly, our proposed method has revealed errors in the validation set of the two commonly used datasets for VLN-CE, i.e., R2R-CE and RxR-CE, demonstrating the utility of our technique in other tasks.
-
Create a virtual environment (tested with
python 3.7
,torch 1.9.1+cu111
,torch-scatter 2.0.9+cu11
). and install base dependencies.conda create --name r2r_ie_ce python=3.7.12 -c conda-forge conda activate r2r_ie_ce
-
Download the Matterport3D scene meshes.
download_mp.py
must be obtained from the Matterport3D project webpage.# run with python 2.7 python download_mp.py --task habitat -o data/scene_datasets/mp3d/ # Extract to: ./data/scene_datasets/mp3d/{scene}/{scene}.glb
Extract such that it has the form data/scene_datasets/mp3d/{scene}/{scene}.glb.
There should be 90 scenes. Place the scene_datasets folder in data
- Follow the Habitat Installation Guide to install
habitat-sim
andhabitat-lab
. We use versionv0.1.7
in our experiments. In brief:
- Install
habitat-sim
for a machine with multiple GPUs or without an attached display (i.e. a cluster):# option 1 - faster wget https://anaconda.org/aihabitat/habitat-sim/0.1.7/download/linux-64/habitat-sim-0.1.7-py3.7_headless_linux_856d4b08c1a2632626bf0d205bf46471a99502b7.tar.bz2 conda install --use-local habitat-sim-0.1.7-py3.7_headless_linux_856d4b08c1a2632626bf0d205bf46471a99502b7.tar.bz2
# option 2 - slower conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
-
Install our project dependencies:
pip install --ignore-installed -r requirements.txt
-
Clone
habitat-lab
from the github repository and install. The command below will install the core of Habitat Lab as well as the habitat_baselines.git clone --branch v0.1.7 https://github.com/facebookresearch/habitat-lab.git cd habitat-lab python setup.py develop --all # install habitat and habitat_baselines
-
Install the tested version of torch -
torch==1.9.1+cu111
and other dependencies:pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html pip install tensorboard==1.15.0 # TensorBoard logging requires TensorBoard version 1.15 or above
-
Download BEVBert weights
ckpt.iter9600.pth
[link] inckpt
folder. Can also be done with gdown (must be installed withpip install gdown
). This model is the best BEVBert model ckpts, to be downloaded only if you want train IEDL from scratch. Otherwise, you can skip this step and download IEDLgdown --fuzzy [link]
-
Download IEDL (TODO)
gdown --fuzzy [link]
-
Download the waypoint predictor
check_cwp_bestdist_hfov90
[link] for CE (continuous environment) and place it indata/wp_pred
gdown --fuzzy [link]
-
Download the
task dataset - R2RIE-CE
from gdrive, and place it underdata/datasets/
cd data/datasets gdown --fuzzy https://drive.google.com/file/d/1GbypzvkiQ-e8M2I77UU5YDIZXi1sHkC3/view?usp=sharing unzip R2RIE_CE_1_3_v1.zip; rm -rf R2RIE_CE_1_3_v1.zip
-
Download
gibson-2plus-resnet50.pth
[link] and place in a folder of your choice.wget [link]
Then, set the path of this .pth
in MODEL.DEPTH_ENCODER.ddppo_checkpoint
in eval
and train
scripts.
For training:
Go to run_R2RIE-CE/train.bash
and set a folder name to save your checkpoints. To do that, set the variale WANDB_RUN_NAME
. Then, copy the original BEVBert ckpt - ckpt/ckpt.iter9600.pth
- in that folder and run the following command:
CUDA_VISIBLE_DEVICES="0,1" bash run_R2RIE-CE/train.bash 2333
For evaluation:
CUDA_VISIBLE_DEVICES="0,1" bash run_R2RIE-CE/eval.bash 2333
See the documentation on how to use the dataset (changing sensor, update task definition, ecc) in the docs folder.
Our implementation is inspired by BEVBert.
Thanks for open sourcing this great work!