dreamsmooth-small.mp4
Reward models, which predict the rewards that an agent would have obtained for some imagined trajectory, play a vital role in state-of-the-art MBRL algorithms like DreamerV3 and TD-MPC because the policy learns from predicted rewards.
Reward prediction in sparse-reward environments, especially those with partial observability or stochastic rewards, is surprisingly challenging.
The following plots show predicted and ground truth rewards over a single episode, in several environments (including Robodesk, ShadowHand, and Crafter), with mispredicted sparse rewards highlighted in yellow.
We propose DreamSmooth, which performs temporal smoothing of the rewards obtained in each rollout before adding them to the replay buffer. Our method makes learning a reward model easier, especially when rewards are ambiguous or sparse.
With our method, the reward models no longer omit sparse rewards from its output, predicting them accurately.
Moreover, the improved reward predictions of DreamSmooth translates to better performance. We studied several different smoothing techniques (Gaussian, uniform, exponential moving average) on many sparse-reward environments, and find that our method outperforms the base DreamerV3 model.
This code is built on top of the official DreamerV3 implementation.
- Ubuntu 22.04
- Python 3.9+
- Install DreamerV3 dependencies
- Install dependencies
pip install -r requirements.txt
- Modified robodesk and hand environments can be found in
embodied/envs/robodesk.py
andembodied/envs/hand.py
embodied/core/smoothing.py
: reward smoothing implementationembodied/agents/dreamerv3/configs.yaml
: configsscripts
: scripts for running experiments
Replace [EXP_NAME]
with name of the experiment, [GPU]
with the GPU number you wish to use, and [WANDB_ENTITY]
and [WANDB_PROJECT]
with the W&B entity/project you want to log to. [SMOOTHING_METHOD]
should be gaussian
, uniform
, exp
, or no
(for no smoothing).
-
Running experiments on Robodesk
source scripts/d3_robodesk_train.sh [EXP_NAME] [GPU] [SEED] [SMOOTHING_METHOD] [SMOOTHING_PARAMETER] [WANDB_ENTITY] [WANDB_PROJECT]
-
Running experiments on Hand
source scripts/d3_hand_train.sh [EXP_NAME] [GPU] [SEED] [SMOOTHING_METHOD] [SMOOTHING_PARAMETER] [WANDB_ENTITY] [WANDB_PROJECT]
-
Running experiments on Crafter
source scripts/d3_crafter_train.sh [EXP_NAME] [GPU] [SEED] [SMOOTHING_METHOD] [SMOOTHING_PARAMETER] [WANDB_ENTITY] [WANDB_PROJECT]
-
Running experiments on Atari
source scripts/d3_atari_train.sh [EXP_NAME] [TASK] [GPU] [SEED] [SMOOTHING_METHOD] [SMOOTHING_PARAMETER] [WANDB_ENTITY] [WANDB_PROJECT]
-
Running experiments on Deepmind Control
source scripts/d3_dmc_train.sh [EXP_NAME] [TASK] [GPU] [SEED] [SMOOTHING_METHOD] [SMOOTHING_PARAMETER] [WANDB_ENTITY] [WANDB_PROJECT]
-
Gaussian Smoothing with sigma = 3 on Robodesk
source scripts/d3_robodesk_train.sh example_01 [GPU] 1 gaussian 3 [WANDB_ENTITY] [WANDB_PROJECT]
-
Uniform Smoothing with delta = 5 on Hand
source scripts/d3_hand_train.sh example_03 [GPU] 1 uniform 5 [WANDB_ENTITY] [WANDB_PROJECT]
@inproceedings{lee2024dreamsmooth,
author = {Vint Lee and Pieter Abbeel and Youngwoon Lee},
title = {DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing},
booktitle = {The Twelfth International Conference on Learning Representations},
year = {2024},
url = {https://openreview.net/forum?id=GruDNzQ4ux}
}