Reinforcement Learning is one of the hottest topics right now, with interest surging after DeepMind published their article on training deep neural networks to play Atari games to great success. mlpack implements a complete end-to-end framework for Reinforcement Learning, featuring multiple environments, policies and methods. Of course, custom environments and policies can be used and plugged into the existing framework with no runtime overhead.
mlpack implements typical benchmark environments (Acrobot, Mountain car etc.), commonly used policies, replay methods and supports asynchronous learning as well. In addition, it can communicate with the OpenAI Gym toolkit for more environments.
mlpack implements a number of the most popular environments used for testing RL agents and algorithms. These include the Cart Pole, Acrobot, Mountain Car and their variations. Of course, as mentioned above, you can communicate with OpenAI Gym for other environments, like the Atari video games.
A key component of mlpack is its extensibility. It is a simple process to create your own custom environments, specific to your needs, and use it with mlpack's RL framework. All the environments implement a few specific methods and classes which are used by the agents while learning.
-
State
: TheState
class is a representation of the environment. For theCartPole
, this would involve storing the position, velocity, angle and angular velocity. -
Action
: For discrete environments,Action
is a class with an enum naming all the possible actions the agent can take in the environment. Continuing with theCartPole
example, the enum would simply contain the two possible actions,backward
andforward
. For continuous environments, theAction
class contains an array with its size depending on the action space. -
Sample
: This method is perhaps the heart of the environment, providing rewards to the agent depending on the state and the action taken, and updates the state based on the action taken as well.
Of course, your custom environment will most likely make use of a number of
helper methods, depending on your application, such as the Dsdt
method in the
Acrobot
environment, used in the RK4
iterative method (also another helper
method) to estimate the next state.
A Reinforcement Learning agent, in general, takes actions in an environment in order to maximize a cumulative reward. To that end, it requires a way to choose actions (policy) and a way to sample previous experiences (replay).
An example of a simple policy would be an epsilon-greedy policy. Using such a policy, the agent will choose actions greedily with some probability epsilon. This probability is slowly decreased over time, balancing the line between exploration and exploitation.
Similarly, an example of a simple replay would be a random replay. At each time step, the interactions between the agent and the environment are saved to a memory buffer and previous experiences are sampled from the buffer to train the agent.
Instantiating the components of an agent can be easily done by passing the Environment as a templated argument and the parameters of the policy/replay to the constructor.
To create a Greedy Policy and Prioritized Replay for the CartPole
environment,
we would do the following:
GreedyPolicy<CartPole> policy(1.0, 1000, 0.1);
PrioritizedReplay<CartPole> replayMethod(10, 10000, 0.6);
The arguments to policy
are the initial epsilon values, the interval of
decrease in its value and the value at which epsilon bottoms out and won't be
reduced further. The arguments to replayMethod
are size of the batch returned,
the number of examples stored in memory, and the degree of prioritization.
In addition to the above components, an RL agent requires many hyperparameters
to be tuned during it's training period. These parameters include everything
from the discount rate of the future reward to whether Double Q-learning should
be used or not. The TrainingConfig
class can be instantiated and configured as
follows:
TrainingConfig config;
config.StepSize() = 0.01;
config.Discount() = 0.9;
config.TargetNetworkSyncInterval() = 100;
config.ExplorationSteps() = 100;
config.DoubleQLearning() = false;
config.StepLimit() = 200;
The object config
describes an RL agent, using a step size of 0.01 for the
optimization process, a discount factor of 0.9, sync interval of 200 episodes.
This agent only starts learning after storing 100 exploration steps, has a step
limit of 200, and does not utilize double q-learning.
In this way, we can easily configure an RL agent with the desired hyperparameters.
mlpack provides several powerful reinforcement learning agents that can be used to solve a wide range of reinforcement learning tasks. Below are the key reinforcement learning agents available in mlpack:
- Q-Learning
- Asynchronous Learning
- Deep Deterministic Policy Gradient (DDPG)
- Twin Delayed Deep Deterministic Policy Gradient (TD3)
- Soft Actor-Critic (SAC)
Each of these reinforcement learning agents is designed to be highly customizable and flexible, allowing users to easily apply them to various environments and tasks.
For further documentation on the reinforcement learning classes, consult the
documentation in the source code, found in
mlpack/methods/reinforcement_learning/
.