This is an implementation of PPO written in PyTorch using OpenAI gym environments.
The algorithm was trained on Pong-v0. The reward graphs are of the moving average reward over the past 100 episodes.
For Pong, the reward metric was averaged over individual 1 point games rather than the entire 21 point game. This makes the minimum reward -1 and the maximum +1.
- python 3.5 or later
- pip
- gym
- gym-snake
- numpy
- matplotlib
- pytorch 0.4.0
You probably know how to clone a repo if you're getting into RL. But in case you don't, open a bash terminal and type the command:
$ git clone https://github.com/grantsrb/PyTorch-PPO
Then navigate to the top folder using
$ cd PyTorch-PPO
Hopefully you have already installed all the appropriate dependencies. See the section called Dependencies for a list of required dependencies.
To run a session on gym's Pong-v0 use the command:
$ python main.py env_type=Pong-v0
This will run a training session with the hyperparameters listed in the hyperparams.py
file.
After training your policy, you can watch the policy run in the environment using the watch_model.py
script. To use this file, pass the name of the saved PyTorch state dict that you would like to watch. You can also specify the environment type and model type using env_type=<name_of_gym_environment>
and model_type=<model_type>
respectively.
Here's an example:
$ python watch_model.py save_file=default_net.p env_type=Pong-v0 model_type=conv
The order of the command line arguments does not matter.
Much of deep learning consists of tuning hyperparameters. It can be extremely addicting to change the hyperparameters by hand and then stare at the average reward as the algorithm trains. THIS IS A HOMERIAN SIREN! DO NOT SUCCUMB TO THE PLEASURE! It is too easy to change hyperparameters before their results are fully known. It is difficult to keep track of what you did, and the time spend toying with hyperparameters can be spent reading papers, studying something useful, or calling your Mom and telling her that you love her (you should do that more often btw. And your dad, too.)
This repo contains an automated system under the name hypersearch.py
.
You can set the ranges of the hyperparameters you would like to search over manually, or use the function: make_hyper_range
located in the hyperparams.py
file. See hypersearch.py
for an example.
RL algorithms often need to be tuned well for them to work. There are tons of hyperparameters that can have large impacts on the training of the algorithm. In order to help with automated hyperparameter tuning, this project offers a number of optional command line arguments. Each is set using <argument_name>=<argument>
with no spaces. For example, if you wanted to set the variable n_envs
(the number of parallel environments) to 15, then you would use the following:
$ python train_model.py n_envs=15
See hyperparams.py
to access the default values.
exp_name
- string of the name of the experiment. Determines the name that the PyTorch state dicts are saved to.model_type
- Denotes the model architecture to be used in training. Options include 'fc', 'conv', 'a3c'env_type
- string of the type of environment you would like to use PPO on. The environment must be an OpenAI gym environment.optim_type
- Denotes the type of optimizer to be used in training. Options: rmsprop, adam
n_epochs
- PPO update epoch countbatch_size
- PPO update batch sizemax_tsteps
- Maximum number of time steps to collect over course of trainingn_tsteps
- integer number of steps to perform in each environment per rollout.n_envs
- integer number of parallel processes to instantiate and use for training.n_frame_stack
- integer number denoting number of observations to stack to be used as the environment state.n_rollouts
- integer number of rollouts to collect per gradient descent update. Whereasn_envs
specifies the number of parallel processes,n_rollouts
indicates how many rollouts should be performed in total amongst these processes.n_past_rews
- number of past epochs to keep statistics from. Only affects logging and statistics printing, no effect on actual training.
lr
- learning ratelr_low
- ifdecay_lr
is set to true, this value denotes the lower limit of thelr
decay.lambda_
- float value of the generalized advantage estimation moving average factor. Only applies if using GAE.gamma
- float value of the discount factor used to discount the rewards and advantages.gamma_high
- ifincr_gamma
is set to true, this value denotes the upper limit of thegamma
increase.val_coef
- float value determining weight of the value loss in the total loss calculationentr_coef
- float value determining weight of the entropy in the total loss calculationentr_coef_low
- ifdecay_entr
is set to true, this value denotes the lower limit of theentr_coef
coeficient decay.max_norm
- denotes the maximum gradient norm for gradient norm clippingepsilon
- PPO update clipping constantepsilon_low
- ifdecay_eps
is set to true, this value denotes the lower limit of theepsilon
decay.
resume
- boolean denoting whether the training should be resumed from a previous point.render
- boolean denoting whether the gym environment should be renderedclip_vals
- if set to true, uses value clipping technique in ppo updatedecay_eps
- if set to true,epsilon
is linearly decreased fromepsilon
's initial value to the lower limit set byepsilon_low
over the course of the entire run.decay_lr
- if set to true,lr
is linearly decreased fromlr
's initial value to the lower limit set bylr_low
over the course of the entire run.decay_entr
- if set to true,entr_coef
is linearly decreased fromentr_coef
's initial value to the lower limit set byentr_coef_low
over the course of the entire run.incr_gamma
- if set to true,gamma
is linearly increased fromgamma
's initial value to the upper limit set bygamma_high
over the course of the entire run.use_nstep_rets
- if set to true, uses n-step returns method for value loss as opposed to empirical discounted rewards.norm_advs
- if set to true, normalizes advantages over entire dataset. Takes precedence overnorm_batch_advs
.norm_batch_advs
- if set to true, normalizes advantages in each training batch. Will not be performed ifnorm_advs
is set to true.use_bnorm
- uses batch normalization in model if set to true
grid_size
- integer denoting square dimensions for size of grid for snake.n_foods
- integer denoting number of food pieces to appear on gridunit_size
- integer denoting number of pixels per unit in grid.
In order to use a new environment, all that needs to be done is pass the name of the new environment using the env_type=
command line argument. The environment must be an OpenAI gym environment.