Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning

This repository contains code used in the experiments in our paper. "Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning" by Nat Dilokthanakul, Christos Kaplanis, Nick Pawlowski, Murray Shanahan (https://arxiv.org/abs/1705.06769)

We adapted the code from an open-source implementation of A3C, namely, “Universe-StarterAgent”. (https://github.com/openai/universe-starter-agent)

UPDATE NOTE:

We later found that a flat A3C agent trained with shaped reward according to equation 3 can perform as well as our feature-control agent in Montezuma Revenge. This result supports the claim that additional auxiliary rewards or loss signals can be beneficial when dealing with sparse reward environments even though the reward can possibly skew the definition of its task.

Importantly, this raises a question about the benefit of having the hierarchical elements proposed in this paper. It appears that decisions made by the meta-controller do not significantly contribute to the success of feature-control agent.

Dependencies

Python 2.7
six (for py2/3 compatibility)
TensorFlow
tmux (the start script opens up a tmux session with multiple windows)
htop (shown in one of the tmux windows)
gym
gym[atari]
universe
opencv-python
numpy
scipy

Getting Started

conda create --name env
source activate env

brew install tmux htop cmake      # On Linux use sudo apt-get install -y tmux htop cmake

pip install gym[atari]
pip install universe
pip install six
pip install tensorflow
conda install -y -c https://conda.binstar.org/menpo opencv3
conda install -y numpy
conda install -y scipy

Add the following to your .bashrc so that you'll have the correct environment when the train.py script spawns new bash shells source activate env

Abstract

The problem of sparse rewards is one of the hardest challenges in contemporary reinforcement learning. Hierarchical reinforcement learning (HRL) tackles this problem by using a set of temporally-extended actions, or options, each of which has its own subgoal. These subgoals are normally handcrafted for specific tasks. Here, though, we introduce a generic class of subgoals with broad applicability in the visual domain. Underlying our approach (in common with work using "auxiliary tasks") is the hypothesis that the ability to control aspects of the environment is an inherently useful skill to have. We incorporate such subgoals in an end-to-end hierarchical reinforcement learning system and test two variants of our algorithm on a number of games from the Atari suite. We highlight the advantage of our approach in one of the hardest games -- Montezuma's revenge -- for which the ability to handle sparse rewards is key. Our agent learns several times faster than the current state-of-the-art HRL agent in this game, reaching a similar level of performance.

Reproducing the results

Experiment 1: Influence of the meta-controller on performance

In this experiment, we evaluated the performance of the pixel-control agent (top row) and the feature-control agent (bottom row). To reproduce the results, checkout to branch pixel_control for pixel-control agent and feature_control for feature-control agent. For example, use the following command for Montezuma's Revenge:

python train.py -w 8 -e MontezumaRevenge-v0 -l ~/experiments/montezuma_experiment

To change the value of beta, edit line 136 of a3c.py to the value of beta we want. For example, for beta = 0.75:

self.beta = 0.75

Experiment 2, 3: With different backpropagation through time (BPTT) length

In this experiment, we improve performance by changing the BPTT length from 20 to 100. In order to run experiments with BPTT = 100, checkout branch feature_control_bptt100. For baseline agent, checkout branch baseline. All the experiments used 8 asynchronous workers.

Extra: Improve stability with target network

In this experiment, we further improve stability of the training by using a target network similar to DQN to calculate the intrinsic reward of the feature control agent. To run the experiment, checkout branch target.

Name		Name	Last commit message	Last commit date
Latest commit History 186 Commits
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
a3c.py		a3c.py
envs.py		envs.py
model.py		model.py
train.py		train.py
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning

UPDATE NOTE:

Dependencies

Getting Started

Abstract

Reproducing the results

Experiment 1: Influence of the meta-controller on performance

Experiment 2, 3: With different backpropagation through time (BPTT) length

Extra: Improve stability with target network

About

Releases

Packages

Contributors 33

Languages

License

Nat-D/FeatureControlHRL

Folders and files

Latest commit

History

Repository files navigation

Feature Control as Intrinsic Motivation for Hierarchical Reinforcement Learning

UPDATE NOTE:

Dependencies

Getting Started

Abstract

Reproducing the results

Experiment 1: Influence of the meta-controller on performance

Experiment 2, 3: With different backpropagation through time (BPTT) length

Extra: Improve stability with target network

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 33

Languages

Packages