10. SARSA

Nov 19, 2024

cd50b60 · Nov 19, 2024

Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md	Update README.md	Nov 19, 2024
SARSA.ipynb	SARSA.ipynb	Add files via upload	Nov 19, 2024
SARSA.py	SARSA.py	Add files via upload	Nov 19, 2024

README.md

SARSA: State-Action-Reward-State-Action

SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that enables an agent to learn the optimal policy in a Markov Decision Process (MDP). Being on-policy, SARSA evaluates and improves the policy that the agent is currently following during learning, rather than a separate, optimal policy. This means the algorithm updates its Q-values (action-value function) based on the actions taken by the current policy, including exploratory actions.

In simpler terms, SARSA not only learns from the environment but also from the actual behavior of the agent, ensuring that the learned values reflect the policy being followed. This approach is particularly useful in environments where adherence to a cautious or safe policy is crucial.

Key Features

On-Policy Learning:
- SARSA learns the value of the current policy being followed, unlike off-policy methods like Q-Learning that evaluate the optimal policy independently of the behavior policy.
Update Rule: The Q-value update in SARSA is governed by the equation: $Q (S, A) \leftarrow Q (S, A) + α [R + γ Q (S^{'}, A^{'}) - Q (S, A)]$ Where:
- $(S, A)$ : Current state and action.
- $(R)$ : Reward received after taking action $(A)$ .
- $(S^{'}, A^{'})$ : Next state and action chosen by the policy.
- $(α)$ : Learning rate.
- $(γ)$ : Discount factor.
Exploration vs. Exploitation:
- SARSA typically uses an epsilon-greedy policy for action selection, allowing the agent to balance exploring new actions and exploiting the best-known actions.

Advantages and Disadvantages

Advantages:

Learns directly from the current policy, making it suitable for environments where cautious learning is important.
Works well in stochastic environments or when safety is a priority.

Disadvantages:

Convergence is slower compared to Q-Learning, especially in deterministic environments.
Performance is sensitive to the exploration strategy (e.g., epsilon-greedy).

How SARSA Works (Step-by-Step)

SARSA iteratively improves the Q-values and the policy by interacting with the environment. Below are the steps:

Initialize:
- Set all Q-values $(Q (s, a))$ to 0 (or small random values).
- Define the exploration policy (e.g., epsilon-greedy).
For Each Episode:
- Start in an initial state $(S)$ .
- Choose an action $(A)$ using the current policy (e.g., epsilon-greedy).
- Repeat until the episode ends:
  1. Take action $(A)$ , observe reward $(R)$ and next state $(S^{'})$ .
  2. Choose next action $(A^{'})$ in $(S^{'})$ using the current policy.
  3. Update $Q (S, A)$ :
    
    $Q (S, A) \leftarrow Q (S, A) + α [R + γ Q (S^{'}, A^{'}) - Q (S, A)]$
  4. Transition to $(S^{'})$ and $(A^{'})$ .
Update Policy:
- Derive the optimal policy from the learned Q-values.

When to Use SARSA

SARSA is particularly useful when:

The environment has stochastic transitions or rewards.
Adhering to a safe or cautious policy is critical (e.g., avoiding dangerous states).
You want to learn the value of the current policy rather than the optimal policy directly.

Comparison with Q-Learning

Feature	SARSA	Q-Learning
Type	On-Policy	Off-Policy
Update Rule	$Q (S, A)$ depends on the action $(A^{'})$ chosen by the current policy.	$Q (S, A)$ uses $(max Q (S^{'}, a))$ (best action in $S^{'}$ ).
Exploration Strategy	Directly affects learning.	Independent of the learned policy.
Convergence Speed	Slower	Faster
Behavior	More cautious, follows the current policy.	Can explore aggressively to find optimal policies.

Numerical Example of SARSA

Example Setup:

A 2x2 grid world:

States: $(S_{1}, S_{2}, S_{3}, S_{4})$ .
Actions: Up $(U)$ , Down $(D)$ , Left $(L)$ , Right $(R)$ .
Rewards:
- $(S_{4})$ : Goal state, reward = $(+ 10)$ .
- Other states have a step cost = $(- 1)$ .
Start state: $(S_{1})$ (top-left corner).
Policy: The agent follows an epsilon-greedy policy.

Grid Layout:

State	State
$(S_{1})$	$(S_{2})$
$(S_{3})$	$(S_{4})$

Initial Q-Table:

$(Q (s, a) = 0)$ for all state-action pairs.

Step-by-Step Walkthrough:

Episode Start:
- Start at $(S_{1})$ .
- Use epsilon-greedy to choose $(A = R)$ .
First Transition:
- Take action $(R)$ from $(S_{1})$ .
- Move to $(S_{2})$ , receive $(R = - 1)$ .
- Choose $(A^{'} = D)$ in $(S_{2})$ using epsilon-greedy.
Update $(Q (S_{1}, R))$ :

$Q (S_{1}, R) \leftarrow Q (S_{1}, R) + α [R + γ Q (S_{2}, D) - Q (S_{1}, R)]$

Substituting values:
- $(Q (S_{1}, R) = 0)$ , $(R = - 1)$ , $(Q (S_{2}, D) = 0)$ , $(α = 0.1)$ , $(γ = 0.9)$ :
$Q (S_{1}, R) \leftarrow 0 + 0.1 [- 1 + 0.9 \times 0 - 0] = - 0.1$
Second Transition:
- Take action $(D)$ from $(S_{2})$ .
- Move to $(S_{4})$ , receive $(R = + 10)$ .
- $(S_{4})$ is terminal, so $(Q (S_{4}, \cdot) = 0)$ .
Update $(Q (S_{2}, D))$ :

$Q (S_{2}, D) \leftarrow Q (S_{2}, D) + α [R + γ Q (S_{4}, \cdot) - Q (S_{2}, D)]$

Substituting values:
- $(Q (S_{2}, D) = 0)$ , $(R = + 10)$ , $(Q (S_{4}, \cdot) = 0)$ , $(α = 0.1)$ , $(γ = 0.9)$ :
$Q (S_{2}, D) \leftarrow 0 + 0.1 [10 + 0 - 0] = 1.0$

Updated Q-Table After One Episode:

State	Action	Q-Value
$(S_{1})$	$(R)$	$- 0.1$
$(S_{2})$	$(D)$	$1.0$
All other states/actions	-	$0$

What This Example Teaches:

SARSA updates $Q (s, a)$ based on the current policy's actions.
Iterative updates refine the Q-values and improve the policy over time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

10. SARSA

10. SARSA

README.md

SARSA: State-Action-Reward-State-Action

Key Features

Advantages and Disadvantages

Advantages:

Disadvantages:

How SARSA Works (Step-by-Step)

When to Use SARSA

Comparison with Q-Learning

Numerical Example of SARSA

Example Setup:

Initial Q-Table:

Step-by-Step Walkthrough:

Updated Q-Table After One Episode:

What This Example Teaches:

Files

10. SARSA

Directory actions

More options

Directory actions

More options

Latest commit

History

10. SARSA

Folders and files

parent directory

README.md

SARSA: State-Action-Reward-State-Action

Key Features

Advantages and Disadvantages

Advantages:

Disadvantages:

How SARSA Works (Step-by-Step)

When to Use SARSA

Comparison with Q-Learning

Numerical Example of SARSA

Example Setup:

Initial Q-Table:

Step-by-Step Walkthrough:

Updated Q-Table After One Episode:

What This Example Teaches: