Model-Free Prediction & Control with Temporal Difference (TD) and Q-Learning

TD-Learning is a combination of Monte Carlo and Dynamic Programming ideas. Like Monte Carlo, TD works based on samples and doesn't require a model of the environment. Like Dynamic Programming, TD uses bootstrapping to make updates.
Whether MC or TD is better depends on the problem and there are no theoretical results that prove a clear winner.
General Update Rule: Q[s,a] += learning_rate * (td_target - Q[s,a]). td_target - Q[s,a] is also called the TD Error.
SARSA: On-Policy TD Control
TD Target for SARSA: R[t+1] + discount_factor * Q[next_state][next_action]
Q-Learning: Off-policy TD Control
TD Target for Q-Learning: R[t+1] + discount_factor * max(Q[next_state])
Q-Learning has a positive bias because it uses the maximum of estimated Q values to estimate the maximum action value, all from the same experience. Double Q-Learning gets around this by splitting the experience and using different Q functions for maximization and estimation.
N-Step methods unify MC and TD approaches. They making updates based on n-steps instead of a single step (TD-0) or a full episode (MC).

Required:

Reinforcement Learning: An Introduction - Chapter 6: Temporal-Difference Learning
David Silver's RL Course Lecture 4 - Model-Free Prediction (video, slides)
David Silver's RL Course Lecture 5 - Model-Free Control (video, slides)

Optional:

[Windy Gridworld Playground](Windy Gridworld Playground.ipynb)
Implement SARSA
- Exercise
- [Solution](SARSA Solution.ipynb)
[Cliff Environment Playground](Cliff Environment Playground.ipynb)
Implement Q-Learning in Python
- Exercise
- [Solution](Q-Learning Solution.ipynb)

Provide feedback