The paper Train Hard, Fight Easy: Robust Meta Reinforcement Learning introduces RoML - a meta-algorithm that takes any meta-learning baseline algorithm and generates a robust version of it. This repo implements RoML on top of the original implementation of VariBAD (we also provide implementations op top of PEARL and MAML). See below how to run RoML in your own favorite algorithmic framework in a few simple steps.
- Reinforcement Learning (RL) aims to learn a policy that makes decisions and maximizes the cumulative rewards (AKA returns) within a given environment.
- Meta-RL aims to learn a "meta-policy" that can adapt quickly to new environments (AKA tasks).
- Robust Meta RL (RoML) is a meta-algorithm that takes a meta-RL baseline algorithm, and generates a robust version of this baseline.
RoML optimizes the returns of the high-risk tasks instead of the average task.
Specifically, it focuses on the
During meta-training, RoML uses the Cross Entropy Method (CEM) to modify the selection of tasks, aiming to sample tasks whose expected return is among the worst
In the bridge environment of Khazad-Dum, VariBAD (left) attempts to take the short path through the bridge, but sometimes falls to the abyss. RoML (right) goes around and avoids the falling risk. |
To train the meta-policies, download this repo and run:
python main.py --env-type ENV --seed 0 1 2 3 4
- Replace
ENV
with the desired environment:khazad_dum_varibad
,cheetah_vel_varibad
,cheetah_mass_varibad
,cheetah_body_varibad
orhumanoid_mass_varibad
. - The line above runs the baseline VariBAD algorithm. For RoML add
--cem 1
. For CVaR-ML (defined in the paper) add--tail 1
(without--cem
). - To reproduce the full experiments of the paper, add seeds up to 29.
To process the results after training, use the module analysis.py
as demonstrated in the notebooks in this repo (.ipynb
files).
RoML can be easily implemented on top of any meta-RL baseline algorithm (instead of VariBAD). To run RoML in your own algorithmic framework, just modify the process of task selection during meta-training:
- Create a CEM sampler before training (e.g., using the Dynamic CEM package).
- When choosing the tasks, use the CEM to do the sampling.
- After running the tasks, update the CEM with the resulted returns.
For example, search "cem" in the module metalearner.py
in this repo.
Important implementation notes:
- Only modify task sampling in training - not in testing.
- The CEM modifies the distribution from which the tasks are selected. For this, the user must define in advance a parametric family of distributions over which the CEM operates, as explained in the CEM package documentation. For example, if the tasks are defined within a bounded interval, we might use Beta distribution; or if the tasks are defined by positive numbers, we could use the exponential distribution. See examples in the module
cross_entropy_sampler.py
in this repo.