We propose BevNav framework to solve these issues by three parts: (i) we introduce a novel Bird’s Eye View (BEV) scene graph (BevSG) that utilizes multi-view 2D information transformed into 3D under the supervision of 3D detection to encode scene layouts and geometric clues. It can distinguish multi-view semantically similar objects and make plans in this graph. (ii) we propose BEV-BLIP contrastive learning that aligns the BEV and language grounding inputs transferring constrain commonsense knowledge in pre-trained models without other training in the environments. (iii) we design BEV-based view search navigation policy, which encourages representations that encode the semantics, relationships, and positional information of objects.
-
Dependeces: We use earlier (
0.2.2
) versions of habitat-sim and habitat-lab. Other related depencese can be found inrequirements.txt
. -
Data (MatterPort3D): Please download the scene dataset and the episode dataset from habitat-lab/DATASETS.md. Then organize the files as follows:
3dNav/
data/
scene_datasets/
mp3d/
episode_datasets/
objectnav_mp3d_v1/
The implementation of BEV Detection is built on MMDetection3D v0.17.1. Please follow BEVFormer for installation.
The implementation of VN is built on the latest version of Matterport3D simulators:
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
Many thanks to the contributors for their great efforts.
The dataset is based on indoor RGB images from Matterport3D. Please fill and sign the Terms of Use agreement form and send it to [email protected] to request access to the dataset.
Note that we use the undistorted_color_images for BEV Detection. Camera parameters (word-to-pixel matrix) are from undistorted_camera_parameters. The 3D box annotations can be available in mp3dbev/data. For VLN, please follow VLN-DUET for more details, including processed annotations, features and pretrained models of REVERIE, R2R and R4R datasets.
Please follow the scripts to extract visual features for both undistorted_color_images (for BEV Detection) and matterport_skybox_images (for VLN, optional). Note that all the ViT features of undistorted_color_images should be used (not only the [CLS] token, about 130 GB). Please note this line since different version of timm models have different output:
b_fts = model.forward_features(images[k: k+args.batch_size])
cd mp3dbev/
# multi-gpu train
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=${PORT:id} ./tools/dist_train.sh ./projects/configs/bevformer/mp3dbev.py 4
# multi-gpu test
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=${PORT:id} ./tools/dist_test.sh ./projects/configs/bevformer/mp3dbev.py ./path/to/ckpts.pth 4
# inference for BEV features
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=${PORT:id} ./tools/dist_test.sh ./projects/configs/bevformer/getbev.py ./path/to/ckpts.pth 4
Please also see train and inference for the detailed usage of MMDetection3D.
For environment setup and dataset preparation, please follow:
For evaluation, please follow:
We provide scripts for quick training and evaluation. The parameters can be found in sh_train_mp3d.sh and sh_eval.sh, You can modify these parameters to customize them according to your specific requirements.
sh sh_train_mp3d.sh # training
sh sh_eval.sh # evaluating