Skip to content

olibridge01/DeepRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python PyTorch

DeepRL - Deep Reinforcement Learning in Python

This repository contains implementations of deep reinforcement learning algorithms on various environments.

Installation

To install the required packages, run the following commands in the terminal:

git clone https://github.com/olibridge01/DeepRL.git
cd DeepRL
pip install -r requirements.txt

Directory Structure

DeepRL/
├── agents/
│   └── (RL agents)
├── figs/
│   ├── gifs/
│   │   └── (gifs of agent behaviour)
│   └── plots/
│       └── (plots e.g. scores/episode durations)    
├── misc/
│   └── (miscellaneous resources e.g. fonts)
├── networks/
│   └── (neural networks used for deep RL algorithms)
├── saves/
│   └── (saved weights for trained agent networks)
├── utils/
│   └── (utility functions e.g. plotters, agent config etc.)
├── README.md
├── requirements.txt
└── run.py (script to train agents/plot gifs etc.)

Reinforcement Learning Algorithms

Below are some examples of the RL algorithms implemented.

  • Random agent: Agent selects moves randomly from the action space. This is not a very useful agent but is used to show how the agent behaves in the following environments by acting at random.
  • Q-learning agent: Agent learns with an epsilon-greedy policy via a table of action value function values - Watkins, 1989.
  • DQN agent: Agent learns with an epsilon-greedy policy via a feedforward neural network with an Experience Replay buffer - Mnih et al, 2013.
  • DDQN agent: Agent learns with Deep Double Q-learning - van Hasselt et al, 2015.

Below is a brief overview of the background theory for the algorithms used.

Background Theory - Action value functions, Q-Learning and DQN

The action value function $Q^\pi: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ for a given policy $\pi$ is defined as the expected return for taking action $a$ in state $s$ and following policy $\pi$ thereafter:

$$ Q^\pi(s,a) = \mathbb{E} \left[ G_t | S_t = s, A_t = a , \pi \right] $$

where the return $G_t$ is the sum of discounted rewards $R$ from time $t$:

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum_{k=0}^\infty \gamma^k R_{t+k+1} $$

The action value function satisfies the Bellman equation:

$$ Q^\pi(s,a) = \mathbb{E} \left[ R_{t+1} + \gamma Q^\pi(S_{t+1},A_{t+1}) | S_t = s, A_t = a, \pi\right] $$

The optimal action value function $Q^*: \mathcal{S} \times \mathcal{A} \to \mathbb{R}$ is the maximum action value function over all possible policies; it satisfies the Bellman optimality equation:

$$ Q^* (s,a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} Q^* (S_{t+1},a') | S_t = s, A_t = a \right] $$

For a deterministic environment - where each state/action pair $(s,a)$ maps to exactly one new state $s'$ - the optimal action value function can be rewritten as:

$$ Q^* (s,a) = r(s,a) + \gamma \max_{a'} Q^* (s',a') $$

where $r(s,a)$ is the reward for taking action $a$ in state $s$.

In Q-Learning, the agent learns an estimate of the optimal action value function $Q\approx Q^*$ by performing an update at each timestep:

$$ Q(s_t,a_t) \gets Q(s_t,a_t) + \alpha \left[ r(s_t,a_t) + \gamma \max_{a'} Q(s_{t+1},a') - Q(s_t,a_t) \right] $$

where $\alpha$ is the learning rate. This method typically makes use of a table to store the action value function values. This requires a continuous state space to be discretised, and is therefore only suitable for small state/action spaces $\mathcal{S}\times\mathcal{A}$.

In Deep Q-Learning (DQN), the agent learns an estimate of the optimal action value function $Q\approx Q^*$ using a neural network. In this project, the torch.nn.SmoothL1Loss() (Huber) loss function was used to train the network.

In both the cases above, performing these updates alone is known as a greedy policy - the agent chooses the action which maximises the action value function at each step. This typically explores the state/action space poorly, so instead an epsilon-greedy policy is used - the agent chooses a random action with probability $\epsilon \in [0,1]$ and the greedy action with probability $1-\epsilon$. The value of $\epsilon$ is decayed over time to encourage the agent to sufficiently explore the state/action space in earlier episodes, then act using the knowledge it has gained in later episodes.

Results

The RL algorithms described above were implemented on a series of OpenAI's gymnasium environments (see below). All results are shown alongside a random agent for comparison.

Direct Observations

In the following examples, the agent receives physical properties of the system as the current state e.g. player position/velocity etc. This typically makes it simpler for the agent to learn an optimal policy.

For the deep RL algorithms, a multi-layer perceptron (MLP) neural network was used to approximate the action value function $Q: \mathcal{S}\times\mathcal{A}\to\mathbb{R}$. The network architecture used for the Cartpole-v1 environment is shown in the corresponding section below.

CartPole-v1

The aim is to keep the pole balanced by only moving the cart left or right. The agent is given a reward of +1 for every timestep it keeps the pole upright. An episode ends if the pole deviates by more than ±12° from the vertical, or the cart leaves the frame. An episode is a success if it keeps the pole upright for 500 timesteps.

The agent receives the following observations:

  • p: The position of the cart on the track.
  • v: The velocity of the cart.
  • a: The angle of the pole from the vertical.
  • av: The pole's angular velocity.

The neural network takes these observations as a 4-dimensional input, and outputs the action value function $Q(s,a')$ for each action $a'\in\mathcal{A}$ (in this case $\mathcal{A} = {0,1}$). The DQN/DDQN algorithms also make use of an experience replay buffer which is described in Mnih et al, 2013.

A plot of the scores as a function of training episodes is shown below for DQN, DDQN and Q-learning. The agents were each run 6 times and the mean rolling score over the last 10 episodes is shown.

In general DQN and DDQN were able to solve Cartpole-v1 in slightly fewer episodes than Q-Learning. Q-Learning performs reasonably well here due to the relatively small state/action space, but will struggle to scale to larger environments.

LunarLander-v2

The aim is to land the Lunar Lander between the flags. The agent is rewarded based on how close the lander is to the pad, how fast it is moving, and how little it has to use its engines. It is also given a reward of -100(+100) for crashing (landing safely) on the moon. An episode terminates if the lander crashes, leaves the frame or comes to rest on the ground. An episode is a success if the agent's cumulative reward is over 200. Gifs of the agent's behaviour are shown below.

MountainCar-v0

The aim is to accelerate the car up the hill to the flag. The agent is given a reward of $-t$ where $t$ is the number of timesteps needed to reach the flag ($t$ is capped at 200). An episode is a success if the car reaches the flag in under 200 timesteps. Gifs of the agent's behaviour are shown below.


Screen Observations

Here, rather than receiving physical properties of the current state (e.g. position, velocity etc.), the agent receives only a screen image (a 2D array of coloured pixels) as the current state. In order to handle this, a few modifications were made to the DQN/DDQN algorithms implemented above. These are listed below, and are described in more detail in Mnih et al, 2013 where the agent was trained to play Atari games.

  • The action value function was approximated using a convolutional neural network (CNN) (with some extra fully-connected layers on the end) rather than an MLP. This allows the agent to learn features from the screen image (a diagram of the architecture is shown in the following section).
  • The screen image was preprocessed before being passed into the network. The images were converted to a single channel (grayscale) and cropped to 84x84. Each observation was stacked with the previous 3 frames to give an input tensor of shape (4,84,84).
  • At each step, 4 frames were skipped and the action was repeated for each of these frames.

CarRacing-v2

The aim is to achieve the highest score possible by racing a car around a racetrack. The car accumulates a reward of -0.1 per frame, and a reward of +1000/N for every racetrack tile visited, where N is the number of tiles on the whole track.

A plot of the rolling scores (100-episode window) as a function of training episodes for the DQN algorithm is shown below. The score clearly improves when the agent learns how to drive the car round the track effectively.


Oli Bridge ([email protected])

Feel free to explore the different directories, and please reach out if you have any questions or suggestions. 🚀