layout | title | permalink |
---|---|---|
default |
One-4-All |
index.html |
{% include_relative _relative_includes/mathjax.html %} {% include_relative _relative_includes/latex_macros.html %}
{% include_relative _relative_includes/authors.html %}
{% include_relative _relative_includes/badges.html %}
{% include_relative _relative_includes/main_video.html %}
Abstract. A fundamental task in robotics is to navigate between two locations. In particular, real-world navigation can require long-horizon planning using high-dimensional RGB images, which poses a substantial challenge for end-to-end learning-based approaches. Current semi-parametric methods instead achieve long-horizon navigation by combining learned modules with a topological memory of the environment, often represented as a graph over previously collected images. However, using these graphs in practice requires tuning a number of pruning heuristics. These heuristics are necessary to avoid spurious edges, limit runtime memory usage and maintain reasonably fast graph queries in large environments. In this work, we present One-4-All (O4A), a method leveraging self-supervised and manifold learning to obtain a graph-free, end-to-end navigation pipeline in which the goal is specified as an image. Navigation is achieved by greedily minimizing a potential function defined continuously over image embeddings. Our system is trained offline on non-expert exploration sequences of RGB data and controls, and does not require any depth or pose measurements. We show that O4A can reach long-range goals in 8 simulated Gibson indoor environments and that resulting embeddings are topologically similar to ground truth maps, even if no pose is observed. We further demonstrate successful real-world navigation using a Jackal UGV platform.
This page aims to present an overview of our method, as well as additional videos, figures and experiment details. Have a look at our paper and at our IROS video for an in-depth presentation of the method!
{% include_relative _relative_includes/youtube_video.html %}
We consider a robot with a discrete action space
{% include_relative _relative_includes/main_diagram.html %}
O4A consists of 4 learnable deep networks trained with previously collected RGB observation trajectories
- The local backbone
$\local$ (left) takes as input RGB images to produce low-dimensional latent embeddings$\code \in \latentspace$ and is trained with a self-supervised time contrastive objective. Once trained, the local backbone can output a local metric$\norm{\code_t - \code_s}$ to measure similarity between observations. The extracted embeddings will also be used as inputs for other modules; - The inverse kinematics head
$\conn$ (center) uses pairs of embeddings to predict the action required to traverse from one embedding to the other (order matters), or the inability to do so through the$\mathtt{NOT\_CONNECTED}$ output;
- The forward kinematics
head (bottom right)
$\fd$ is trained using edges from$\graph$ to predict the next embedding$\code_j$ given the current embedding$\code_i$ and an action$a_{ij} \in \actions$ ; - The geodesic regressor
$\georeg$ (top right), which learns to predict the shortest path length between images embeddings.$\georeg$ is the core planning module and can be interpreted as encoding the geometry of$\graph$ .
When multiple environments are considered,
The geodesic regressor
{% include_relative _relative_includes/potentials.html %}
Furthermore, we call
We perform our simulation experiments using 8 scenes from the Gibson dataset rendered with the
Habitat simulator. Trajectories are categorized into easy (1.5
{% include_relative _relative_includes/row_videos.html title="Aloha" src_1="img/aloha/Aloha_easy.mp4" src_2="img/aloha/Aloha_medium.mp4" src_3="img/aloha/Aloha_hard.mp4" src_4="img/aloha/Aloha_very_hard.mp4" %}
{% include_relative _relative_includes/row_videos.html title="Annawan" src_1="img/annawan/Annawan_easy.mp4" src_2="img/annawan/Annawan_medium.mp4" src_3="img/annawan/Annawan_hard.mp4" src_4="img/annawan/Annawan_very_hard.mp4" %}
{% include_relative _relative_includes/row_videos.html title="Cantwell" src_1="img/cantwell/Cantwell_easy.mp4" src_2="img/cantwell/Cantwell_medium.mp4" src_3="img/cantwell/Cantwell_hard.mp4" src_4="img/cantwell/Cantwell_very_hard.mp4" %}
{% include_relative _relative_includes/row_videos.html title="Dunmor" src_1="img/dunmor/Dunmor_easy.mp4" src_2="img/dunmor/Dunmor_medium.mp4" src_3="img/dunmor/Dunmor_hard.mp4" src_4="img/dunmor/Dunmor_very_hard.mp4" %}
{% include_relative _relative_includes/row_videos.html title="Eastville" src_1="img/eastville/Eastville_easy.mp4" src_2="img/eastville/Eastville_medium.mp4" src_3="img/eastville/Eastville_hard.mp4" src_4="img/eastville/Eastville_very_hard.mp4" %}
{% include_relative _relative_includes/row_videos.html title="Hambleton" src_1="img/hambleton/Hambleton_easy.mp4" src_2="img/hambleton/Hambleton_medium.mp4" src_3="img/hambleton/Hambleton_hard.mp4" src_4="img/hambleton/Hambleton_very_hard.mp4" %}
{% include_relative _relative_includes/row_videos.html title="Nicut" src_1="img/nicut/Nicut_easy.mp4" src_2="img/nicut/Nicut_medium.mp4" src_3="img/nicut/Nicut_hard.mp4" src_4="img/nicut/Nicut_very_hard.mp4" %}
{% include_relative _relative_includes/row_videos.html title="Sodaville" src_1="img/sodaville/Sodaville_easy.mp4" src_2="img/sodaville/Sodaville_medium.mp4" src_3="img/sodaville/Sodaville_hard.mp4" src_4="img/sodaville/Sodaville_very_hard.mp4" %}
{% include_relative _relative_includes/main_table.html %}
{% include_relative _relative_includes/jackal_table.html %}
{% include_relative _relative_includes/jackal_video.html %}
{% include_relative _relative_includes/graph_embedding_images_1.html %} {% include_relative _relative_includes/graph_embedding_images_2.html %}
We finally show detailed architectures and hyperparameters for O4A and the baselines we used.
{% include_relative _relative_includes/architecture_o4a.html %}
{% include_relative _relative_includes/architecture_sptm.html %}
{% include_relative _relative_includes/architecture_ving.html %}
{% include_relative _relative_includes/hyperparam_table.html %}
{% include_relative _relative_includes/extended_results.html %}
{% include_relative _relative_includes/augmentations_images.html src_1="img/augmentations/original.png" caption_1="Original" src_2="img/augmentations/brightness_contrast.png" caption_2="Brightness/contrast" src_3="img/augmentations/dropout.png" caption_3="Dropout" src_4="img/augmentations/gauss_noise.png" caption_4="Gaussian noise" src_5="img/augmentations/hue_saturation.png" caption_5="Hue saturation" %} {% include_relative _relative_includes/augmentations_images.html src_1="img/augmentations/jitter.png" caption_1="Color jitter" src_2="img/augmentations/motion_blur.png" caption_2="Brightness/Motion blur" src_3="img/augmentations/perspective.png" caption_3="Perspective change" src_4="img/augmentations/sharpening.png" caption_4="Sharpening" src_5="img/augmentations/shift_scale_rotate.png" caption_5="Shift-Scale-Rotate" %}
{% include_relative _relative_includes/citation.html %}