This repository contains:
- the implementation of navigation agents for our paper: Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation;
- a dataset for pretraining outdoor VLN task.
In this project, we use the Touchdown dataset and the StreetLearn dataset. More details regarding these two datasets can be found here.
Our pre-training dataset is built upon StreetLearn.
The guiding instructions for the outdoor VLN task are provided in touchdown/datasets/
.
To download the panoramas, please refer to Touchdown Dataset and StreetLearn Dataset.
- Python 3.6
- PyTorch 1.7.0
- Texar
We conduct experiments on Ubuntu 18.04 and Titan RTX.
Please run the following lines to download the code and install Texar:
git clone https://github.com/VegB/VLN-Transformer/
cd VLN-Transformer/
pip install [--user] -e . # install Texar
cd touchdown/
Training can be performed with the following command:
python main.py --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --exp_name [EXP_NAME]
DATASET
is the dataset for outdoor navigation. This repo currently support the following three datasets:touchdown
is a dataset for outdoor VLN, the instructions are written by human annotators;manh50
is a subset of StreetLearn, the instructions are generated by Google Map API;manh50_mask
has the same trajectories asmanh50
, but the instructions are style-modified (which is what we do in this paper).
IMG_DIR
contains the encoded panoramas forDATASET
. After you get access to the panoramas, please encode them accordingly. Each file in this directory should be a numpy file[PANO_ID].npy
that represent the panorama that has corresponding pano_id. The encoding process are described in Touchdown paper, Section D.1.MODEL
is the navigation agent, may berconcat
for RCONCAT orvlntrans
for VLN Transformer.
More parameters and usage are listed here.
It should be noted here that vlntrans
use BERT (bert-base-uncased) to encode the instruction and it takes a lot of space,
which means you may need to adjust the batch size accordingly to fit the model into your GPU.
In our experiments, we use 3 piece of Titan RTX and a batch size of 30.
This is the command we use to pretrain VLN Transformer on our instruction-style-modified dataset:
CUDA_VISIBLE_DEVICES="0,1,2" python main.py --dataset 'manh50_mask' --img_feat_dir '/data/manh50_features_mean/' --model 'vlntrans' --batch_size 30 --max_num_epochs 15 --exp_name 'pretrain_mask'
We can finetune the VLN agent on pre-trained models.
python main.py --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --resume_from [PRETRAINED_MODEL] --resume [RESUME_OPTION]
PRETRAINED_MODEL
specified the pre-trained model;RESUME_OPTION
specifies the checkpointlatest
: the most recent ckpt;TC_best
: the ckpt with the best TC score on dev set;SPD_best
: the ckpt with the best SPD score on dev set.
We can evaluate the agent's navigation performance on the test set and dev set with the following command:
python main.py --test True --dataset [DATASET] --img_feat_dir [IMG_DIR] --model [MODEL] --resume_from [PRETRAINED_MODEL] --resume [RESUME_OPTION] --CLS [True/False] --DTW [True/False]
The pre-trained models for VLN Transformer, RCONCAT and GA can be downloaded
from here.
Please place them in checkpoints/
.
To reproduce the results in our paper, please use the following commands:
CUDA_VISIBLE_DEVICES="0" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'rconcat' --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True
CUDA_VISIBLE_DEVICES="1" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'ga' --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True
CUDA_VISIBLE_DEVICES="2" python main.py --test True --dataset 'touchdown' --img_feat_dir [IMG_DIR] --model 'vlntrans' --batch_size 30 --resume_from [PRETRAINED_MODEL] --resume 'TC_best' --CLS True --DTW True
PRETRAINED_MODEL
specified the pre-trained modelvanilla
: Navigation agent trained ontouchdown
dataset without pre-training on auxiliary datasets.finetuned_manh50
: Pre-trained onmanh50
dataset, and finetuned ontouchdown
dataset.finetuned_mask
: Pre-trained onmanh50_mask
dataset, and finetuned ontouchdown
dataset.
@misc{zhu2020multimodal,
title={Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation},
author={Wanrong Zhu and Xin Wang and Tsu-Jui Fu and An Yan and Pradyumna Narayana and Kazoo Sone and Sugato Basu and William Yang Wang},
year={2020},
eprint={2007.00229},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The code and data can't be built without streetlearn, speaker_follower, touchdown, and Texar. We also thank @Jiannan Xiang for his contribution in reproducing the Touchdown task.