Skip to content

jingw2/policy_gradient

Repository files navigation

Policy Gradient Algorithm

This section is about policy gradient method, including simple policy gradient method and trust region policy optimization. Based on cart-v0 environment from openAI gym module, different methods are implemented using pytorch. The correspondent hyperparameters are from the correspondent algorithm paper.

Vanilla Policy gradient theory

Policy gradient is directly to approximate the policy function, mapping the state to the action. It has the following advantages compared to value function approximation:

  • Parameterization is simpler, easy to converge
  • feasible to solve when action space is huge or infinite
  • random policy can ensemble exploration

of course, it has some shortcomings:

  • local optimal
  • big variable to evaluate single policy

The objective to optimize the expectation of rewards. By importance sampling, we want to calculate the policy gradient. Normally, random policy can be expressed as

equation

one fixed policy + random part. We can use different functions to approximate the fixed part. Random part can be normal distribution. The loss function is as follows

equation

One training curve is as follows

Trust Region Policy Optimization (TRPO)

The theory of trpo can be seen in paper. Here, I want to show steps and explain the update method

  • sample actions and trajectories
  • calculate mean KL divergence and fisher vector product
  • construct surrogate loss and line search method by conjugate gradient algorithm

Update steps:

  • calculate advantage, surrogate loss, and policy gradient.
  • if no gradient, return
  • update theta, try to update value function, and policy network

line search:

linesearch

where ei is expected improvement

One training curve in cart-v0 as follows:

From my experience in TRPO, it is sensitive to the quality of data. And based on PPO and my experiments, TRPO might perform very differently in multiple trials because it uses a hard constraint on KL divergence. Thus, in order to improve TRPO, PPO came in.

Proximal Policy Optimization

It is simpler and more stable than TRPO. Firstly, define the ratio

equation

In order to modify the objective to penalize changes to the policy that move the ratio away from 1, clipped surrogate objective, fixed KL and adaptive KL penalty coefficient methods have been came up in the paper.

  • Clipped surrogate objective

equation

where equation

  • Fixed KL and adaptive KL penalty coefficient

equation

With respect to adaptive KL, we have d shown above, and then

equation

The following graphs show the learning rewards,

Actor-Critic and Deep Deterministic Policy Gradient

This is especially for continuous controls using deep reinforcement learning. The main goal is to find the deterministic policy from exploratory behavior policy. From my experience, it does not converge faster than PPO, and sometimes it is sensitive to the quality of data.

Learning curve:

In order to better train the model, someone came up with the data filter to filter dirty data and get better model.

Reference link:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages