class: middle, center, title-slide
Lecture 9: Reinforcement Learning
Prof. Gilles Louppe
[email protected]
How to make decisions under uncertainty, while learning about the environment?
.grid[ .kol-1-2[
- Reinforcement learning (RL)
- Passive RL
- Model-based estimation
- Model-free estimation
- Direct utility estimation
- Temporal-difference learning
- Active RL
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.footnote[Image credits: CS188, UC Berkeley.]
???
Offline solution = Planning
class: middle
A short recap.
A Markov decision process (MDP) is a tuple
-
$\mathcal{S}$ is a set of states$s$ ; -
$\mathcal{A}$ is a set of actions$a$ ; -
$P$ is a (stationary) transition model such that$P(s'|s,a)$ denotes the probability of reaching state$s'$ if action$a$ is done in state$s$ ; -
$R$ is a reward function that maps immediate (finite) reward values$R(s)$ obtained in states$s$ . - (
$0 < \gamma \leq 1$ is the discount factor.)
class: middle
.grid[
.kol-1-5.center[
]
class: middle
- Although MDPs generalize to continuous state-action spaces, we assume in this lecture that both
$\mathcal{S}$ and$\mathcal{A}$ are discrete and finite. - The formalism we use to define MDPs is not unique. A quite well-established and equivalent variant is to define the reward function with respect to a transition
$(s,a,s')$ , i.e.$R(s,a,s')$ . This results in new (but equivalent) formulations of the algorithms covered in Lecture 8.
The utility of a state is the immediate reward for that state, plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action:
.footnote[Image credits: CS188, UC Berkeley.]
The value iteration algorithm provides a fixed-point iteration procedure for computing the state utilities
- Let
$V_i(s)$ be the estimated utility value for$s$ at the$i$ -th iteration step. - The Bellman update consists in updating simultaneously all the estimates to make them locally consistent with the Bellman equation:
$$V_{i+1}(s) = R(s) + \gamma \max_a \sum_{s'} P(s'|s,a) V_i(s'). $$ - Repeat until convergence.
The policy iteration algorithm directly computes the policy (instead of state values). It alternates the following two steps:
- Policy evaluation: given
$\pi_i$ , calculate$V_i = V^{\pi_i}$ , i.e. the utility of each state if$\pi_i$ is executed:$$V_i(s) = R(s) + \gamma \sum_{s'} P(s'|s,\pi_i(s)) V_i(s').$$ - Policy improvement: calculate a new policy
$\pi_{i+1}$ using one-step look-ahead based on$V_i$ :$$\pi_{i+1}(s) = \arg\max_a \sum_{s'} P(s'|s,a)V_i(s').$$
class: middle
class: middle, black-slide
.center[
.footnote[Video credits: Megan Hayes, @YAWScience, 2020.]
class: middle, black-slide
.center[
.footnote[Video credits: Megan Hayes, @YAWScience, 2020.]
class: middle
- This wasn't planning, it was reinforcement learning!
- There was an MDP, but the chicken couldn't solve it with just computation.
- The chicken needed to actually act to figure it out.
- Exploration: you have to try unknown actions to get information.
- Exploitation: eventually, you have to use what you know.
- Regret: even if you learn intelligently, you make mistakes.
- Sampling: because of chance, you have to try things repeatedly.
- Difficult: learning can be much harder than solving a known MDP.
???
There was a chicken, in some unknown MDP. The chicken wanted to maximise his reward.
We still assume a Markov decision process
-
$\mathcal{S}$ is a set of states$s$ ; -
$\mathcal{A}$ is a set of actions$a$ ; -
$P$ is a (stationary) transition model such that$P(s'|s,a)$ denotes the probability of reaching state$s'$ if action$a$ is done in state$s$ ; -
$R$ is a reward function that maps immediate (finite) reward values$R(s)$ obtained in states$s$ .
Our goal is find the optimal policy
class: middle
The transition model
- We do not know which states are good nor what actions do!
- We must observe or interact with the environment in order to jointly learn these dynamics and act upon them.
.grid[
.kol-1-5.center[
$$s'$$ $$r' = \underbrace{R(s')}_{???}$$ ] .kol-3-5.center[$$s$$ .width-90[]
$$s' \sim \underbrace{P(s'|s,a)}_{???}$$ ] .kol-1-5[
$$a$$ ] ]
???
Imagine playing a new game whose rules you don’t know; after a hundred or so moves, your opponent announces, “You lose.” This is reinforcement learning in a nutshell.
- The agent's policy
$\pi$ is fixed. - Its goal is to learn the utilities
$V^\pi(s)$ . - The learner has no choice about what actions to take. It just executes the policy and learns from experience.
.footnote[Image credits: CS188, UC Berkeley.]
???
This is not offline planning. You actually take actions in the world! (Since
class: middle
The agent executes a set of trials (or episodes) in the environment using policy
- Trial 1:
$(B, -1, \text{east}, C), (C, -1, \text{east}, D), (D, +10, \text{exit}, \perp)$ - Trial 2:
$(B, -1, \text{east}, C), (C, -1, \text{east}, D), (D, +10, \text{exit}, \perp)$ - Trial 3:
$(E, -1, \text{north}, C), (C, -1, \text{east}, D), (D, +10, \text{exit}, \perp)$ - Trial 4:
$(E, -1, \text{north}, C), (C, -1, \text{east}, A), (A, -10, \text{exit}, \perp)$
A model-based agent estimates approximate transition and reward models
- Step 1: Learn an empirical MDP.
- Estimate
$\hat{P}(s'|s,a)$ from empirical samples$(s,a,s')$ or with supervised learning. - Discover each
$\hat{R}(s)$ for each$s$ .
- Estimate
- Step 2: Evaluate
$\pi$ using$\hat{P}$ and$\hat{R}$ , e.g. as$$V(s) = \hat{R}(s) + \gamma \sum_{s'} \hat{P}(s'|s,\pi(s)) V(s').$$
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[
.kol-1-4.smaller-x[
Policy
.width-100[]
]
.kol-3-4[
.smaller-x[
Trajectories:
] ] ]
.grid.smaller-x[
.kol-1-2[
Learned transition model
Can we learn
(a.k.a. Monte Carlo evaluation)
- The utility
$V^\pi(s)$ of state$s$ is the expected total reward from the state onward (called the expected reward-to-go)$$V^\pi(s) = \mathbb{E}\left[\sum_{t=0}^\infty \gamma^t R(s_t) \right]\Biggr\rvert_{s_0=s}$$ - Each trial provides a sample of this quantity for each state visited.
- Therefore, at the end of each sequence, one can update a sample average
$\hat{V}^\pi(s)$ by:- computing the observed reward-to-go for each state;
- updating the estimated utility for that state, by keeping a running average.
- In the limit of infinitely many trials, the sample average will converge to the true expectation.
class: middle
.grid[
.kol-1-4.smaller-x[
Policy
.width-100[]
]
.kol-3-4[
.smaller-x[
Trajectories:
.grid[
.kol-1-4.smaller-x[
Output values
.width-100[]
]
.kol-3-4.center.italic[
If both
how can their values be different?]
]
class: middle
Unfortunately, direct utility estimation misses the fact that the state values
Temporal-difference (TD) learning consists in updating
When a transition from
???
Instead waiting for a complete trajectory to update
...
- If
$r + \gamma V^\pi(s') > V^\pi(s)$ then the prediction$V^\pi(s)$ underestimates the value, hence the increment. - If
$r + \gamma V^\pi(s') < V^\pi(s)$ then the prediction$V^\pi(s)$ overestimates the value, hence the decrement.
class: middle
Alternatively, the TD-update can be viewed as a single gradient descent step on the squared error between the target
class: middle
The TD-update can equivalently be expressed as the exponential moving average
Intuitively,
- this makes recent samples more important;
- this forgets about the past (distant past values were wrong anyway).
class: middle
.grid[
.kol-1-4[]
.kol-1-2.center[.width-45[] .width-45[
]
Transition:
TD-update:
$\begin{aligned} V^\pi(B) &\leftarrow V^\pi(B) + \alpha(R(B) + \gamma V^\pi(C) - V^\pi(B)) \\ &\leftarrow 0 + 0.5 (-1 + 0 - 0) \\ &\leftarrow -0.5 \end{aligned}$
class: middle
.grid[
.kol-1-4[]
.kol-1-2.center[.width-45[] .width-45[
]
Transition:
TD-update:
$\begin{aligned} V^\pi(C) &\leftarrow V^\pi(C) + \alpha(R(C) + \gamma V^\pi(D) - V^\pi(C)) \\ &\leftarrow 0 + 0.5 (-1 + 8 - 0) \\ &\leftarrow 3.5 \end{aligned}$
???
Note how the large reward eventually propagates back to the states leading to it.
Doing the first update again would result in a better value for
class: middle
- Notice that the TD-update involves only the observed successor
$s'$ , whereas the actual Bellman equations for a fixed policy involves all possible next states. Nevertheless, the average value of$V^\pi(s)$ will converge to the correct value. - If we change
$\alpha$ from a fixed parameter to a function that decreases as the number of times a state has been visited increases, then$V^\pi(s)$ will itself converge to the correct value.
- The agent's policy is not fixed anymore.
- Its goal is to learn the optimal policy
$\pi^*$ or the state values$V(s)$ . - The learner makes choices!
- Fundamental trade-off: exploration vs. exploitation.
.footnote[Image credits: CS188, UC Berkeley.]
The passive model-based agent can be made active by instead finding the optimal policy
For example, having obtained a utility function
class: middle, center
The agent does not learn the true utilities or the true optimal policy!
class: middle
The resulting policy is greedy and suboptimal:
- The learned transition and reward models
$\hat{P}$ and$\hat{R}$ are not the same as the true environment since they are based on the samples obtained by the agent's policy, which biases the learning. - Therefore, what is optimal in the learned model can be suboptimal in the true environment.
Actions do more than provide rewards according to the current learned model. They also contribute to learning the true environment.
This is the exploitation-exploration trade-off:
- Exploitation: follow actions that maximize the rewards, under the current learned model;
- Exploration: follow actions to explore and learn about the true environment.
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Simplest approach for forcing exploration: random actions (
- With a (small) probability
$\epsilon$ , act randomly. - With a (large) probability
$(1-\epsilon)$ , follow the current policy.
class: middle
Better idea: explore areas whose badness is not (yet) established, then stop exploring.
Formally, let
For Value Iteration, the update equation becomes
The function
???
This is similar to MCTS! (Lecture 3)
Although temporal difference learning provides a way to estimate
.center.width-50[]
.footnote[Image credits: CS188, UC Berkeley.]
.grid[ .kol-1-2[
- The state-value
$V(s)$ of the state$s$ is the expected utility starting in$s$ and acting optimally. - The state-action-value
$Q(s,a)$ of the q-state$(s,a)$ is the expected utility starting out having taken action$a$ from$s$ and thereafter acting optimally. Therefore$V(s) = \max_a Q(s,a)$ . ] .kol-1-2.width-100[] ]
class: middle
The optimal policy
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Since
As for value iteration, the last equation can be used as an update equation for a fixed-point iteration procedure that calculates the Q-values
The state-action-values
Q-Learning consists in updating
The update equation for TD Q-Learning is
.alert[Since
class: middle
class: middle
Q-Learning converges to an optimal policy, even when acting suboptimally.
- This is called off-policy learning.
- Technical caveats:
- You have to explore enough.
- The learning rate must eventually become small enough.
- ... but it shouldn't decrease too quickly.
.footnote[Image credits: CS188, UC Berkeley.]
.grid[ .kol-2-3[
- Basic Q-Learning keeps a table for all Q-values
$Q(s,a)$ . - In realistic situations, we cannot possibly learn about every single state!
- Too many states to visit them all in training.
- Too many states to hold the Q-table in memory.
- We want to generalize:
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid.center[ .kol-1-8[] .kol-1-4[(a)
If we discover by experience that (a) is bad, then in naive Q-Learning, we know nothing about (b) nor (c)!
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
.grid[ .kol-3-4[
Solution: describe a state
- Features are functions
$f_k$ from states to real numbers that capture important properties of the state. - Example features:
- Distance to closest ghost
- Distance to closest dot
- Number of ghosts
- ...
- Can similarly describe a q-state
$(s, a)$ with features$f_k(s,a)$ . ] .kol-1-4.width-100[] ]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
Using a feature-based representation, the Q-table can now be replaced with a function approximator, such as a linear model
Upon the transition
.footnote[Image credits: CS188, UC Berkeley.]
???
Remember that the TD-update can be viewed as a online GD update, except now we dot not directly modify
class: middle
In linear regression, imagine we had only one point
hence the Q-update
class: middle
Similarly, the Q-table can be replaced with a neural network as function approximator, resulting in the DQN algorithm.
class: middle, center
(demo)
class: middle
class: middle, black-slide
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/Tnu4O_xEmVk?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>MarIQ ]
class: middle, black-slide
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/l5o429V1bbU?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>Playing Atari Games (Pinball) ]
class: middle, black-slide
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/W4joe3zzglU?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>Robotic manipulation ]
class: middle, black-slide
.center[
<iframe width="640" height="480" src="https://www.youtube.com/embed/fBiataDpGIo?&loop=1&start=0" frameborder="0" volume="0" allowfullscreen></iframe>Drone racing ]
.footnote[Image credits: CS188, UC Berkeley.]
class: middle
By the end of this course, you will have built autonomous agents that efficiently make decisions in fully informed, partially observable and adversarial settings. Your agents will draw inferences in uncertain and unknown environments and optimize actions for arbitrary reward structures.
class: end-slide, center count: false
The end.