+ We show examples of unconditional generations from the model in diverse scenes with different driving dynamics. +
+
+ We show examples of ego-motion controllability. All videos are generated by GEM using the same starting frame but with different trajectory control input.
+ We observe that the model follows the control signals and generates realistic scenes.
+
+ GEM can move objects in the scene using DINO features.
+ In the following examples, we show an unconditional generation by GEM and the same generation with motion control.
+ The green box indidcates the source DINO features and the blue ones indicate the target position tokens used.
+ We observe that the object moves from the green box to the blue box.
+
Unconditional Generation
+ +Object Motion Control
+ +Unconditional Generation
+ +Insertion Control
+ ++ In the following example, we insert a car on the left and control the motion of another car on the right. +
+ +
+ GEM can use human poses to control the motion of pedestrians in the scene.
+ In this examples, the pedesterians are crossing the street or stopping based on the human poses controls.
+
+ We compare our long generation with the only world model trained on OpenDV capable of generating long sequences.
+ We observe that our generations have higher ego motion temporal consistency and more realistic dynamics.
+
GEM
+ +Vista
+ +
+ We show interesting behaviors observed in the generated videos.
These behaviors do not necessarily exist in the ground truth videos, but emerge from the model's learned dynamics.
+
Break Lights go off before moving
+ +Smooth takeover dynamics on a long generation
+ ++ GEM generates two modalities simultaneously: RGB and Depth. We show examples of multimodal generations. +
+ ++ GEM is finetuned on two other ego centric domains and we observe it quickly adapts to these new domains. +
+ + ++ Some visualisations of the outputs of our pseudo-labeling pipeline. +
+