SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that enables an agent to learn the optimal policy in a Markov Decision Process (MDP). Being on-policy, SARSA evaluates and improves the policy that the agent is currently following during learning, rather than a separate, optimal policy. This means the algorithm updates its Q-values (action-value function) based on the actions taken by the current policy, including exploratory actions.
In simpler terms, SARSA not only learns from the environment but also from the actual behavior of the agent, ensuring that the learned values reflect the policy being followed. This approach is particularly useful in environments where adherence to a cautious or safe policy is crucial.
-
On-Policy Learning:
- SARSA learns the value of the current policy being followed, unlike off-policy methods like Q-Learning that evaluate the optimal policy independently of the behavior policy.
-
Update Rule: The Q-value update in SARSA is governed by the equation:
Where: -
: Current state and action. -
: Reward received after taking action . -
: Next state and action chosen by the policy. -
: Learning rate. -
: Discount factor.
-
-
Exploration vs. Exploitation:
- SARSA typically uses an epsilon-greedy policy for action selection, allowing the agent to balance exploring new actions and exploiting the best-known actions.
- Learns directly from the current policy, making it suitable for environments where cautious learning is important.
- Works well in stochastic environments or when safety is a priority.
- Convergence is slower compared to Q-Learning, especially in deterministic environments.
- Performance is sensitive to the exploration strategy (e.g., epsilon-greedy).
SARSA iteratively improves the Q-values and the policy by interacting with the environment. Below are the steps:
-
Initialize:
- Set all Q-values
to 0 (or small random values). - Define the exploration policy (e.g., epsilon-greedy).
- Set all Q-values
-
For Each Episode:
- Start in an initial state
. - Choose an action
using the current policy (e.g., epsilon-greedy). - Repeat until the episode ends:
-
Take action
, observe reward and next state . -
Choose next action
in using the current policy. -
Update
: -
Transition to
and .
-
- Start in an initial state
-
Update Policy:
- Derive the optimal policy from the learned Q-values.
SARSA is particularly useful when:
- The environment has stochastic transitions or rewards.
- Adhering to a safe or cautious policy is critical (e.g., avoiding dangerous states).
- You want to learn the value of the current policy rather than the optimal policy directly.
Feature | SARSA | Q-Learning |
---|---|---|
Type | On-Policy | Off-Policy |
Update Rule |
|
|
Exploration Strategy | Directly affects learning. | Independent of the learned policy. |
Convergence Speed | Slower | Faster |
Behavior | More cautious, follows the current policy. | Can explore aggressively to find optimal policies. |
A 2x2 grid world:
-
States:
. -
Actions: Up
, Down , Left , Right . -
Rewards:
-
: Goal state, reward = . - Other states have a step cost =
.
-
-
Start state:
(top-left corner). - Policy: The agent follows an epsilon-greedy policy.
Grid Layout:
State | State |
---|---|
-
for all state-action pairs.
-
Episode Start:
- Start at
. - Use epsilon-greedy to choose
.
- Start at
-
First Transition:
- Take action
from . - Move to
, receive . - Choose
in using epsilon-greedy.
Update
: Substituting values:
-
, , , , :
- Take action
-
Second Transition:
- Take action
from . - Move to
, receive . -
is terminal, so .
Update
: Substituting values:
-
, , , , :
- Take action
State | Action | Q-Value |
---|---|---|
All other states/actions | - |
- SARSA updates
based on the current policy's actions. - Iterative updates refine the Q-values and improve the policy over time.