This collection showcases various projects focused on Deep Reinforcement Learning techniques. The projects are organized in a matrix structure: [environment x algorithm], where environment represents the challenge to be tackled, and algorithm denotes the method employed to solve it. In certain instances, multiple algorithms are applied to the same environment. Each project is presented as a Jupyter notebook, complete with a comprehensive training log.
The collection encompasses the following environments:
AntBulletEnv, BipedalWalker, BipedalWalkerHardcore, CarRacing, CartPole, Crawler, HalfCheetahBulletEnv,
HopperBulletEnv, LunarLander, LunarLanderContinuous, Markov Decision 6x6, Minitaur, Minitaur with Duck,
MountainCar, MountainCarContinuous, Pong, Navigation, Reacher, Snake, Tennis, Waker2DBulletEnv.
Four environments (Navigation, Crawler, Reacher, Tennis) are solved in the framework of the
Udacity Deep Reinforcement Learning Nanodegree Program.
-
Monte-Carlo Methods
In Monte Carlo (MC) methods, we play through episodes of the game until completion, collect the rewards along the way, and then trace back to the start of the episode. This process is repeated multiple times, and the average value of each state is calculated. -
Function Approximation and Neural Network
The Universal Approximation Theorem (UAT) states The Universal Approximation Theorem (UAT) states that feed-forward neural networks with a single hidden layer and a finite number of nodes can approximate any continuous function, given certain mild assumptions about the activation function are met. -
Policy-Based Methods, Hill-Climbing, Simulating Annealing
Random-restart hill-climbing is often surprisingly effective. Simulated annealing is a useful probabilistic technique because it avoids mistaking local extrema for global extrema. -
Policy-Gradient Methods, REINFORCE, PPO
Define a performance measure J(\theta) to maximaze. Learn policy paramter \theta throgh approximate gradient ascent.
-
Actor-Critic Methods, A3C, A2C, DDPG, TD3, SAC
The key difference from A2C is the asynchronous aspect. A3C involves multiple independent agents (networks) with their own weights, interacting with different copies of the environment in parallel, thus exploring a larger part of the state-action space more quickly. -
Forward-Looking Actor or FORK
Model-based reinforcement learning leverages the model in a sophisticated manner, often utilizing deterministic or stochastic optimal control theory to optimize the policy based on the model. FORK uses the system network as a black box to predict future states, without using it as a mathematical model for optimizing control actions. This distinction allows any model-free Actor-Critic algorithm with FORK to remain model-free.