Q-learning has a good exploration-exploitation tradeoff compared to on-policy methods used in reinforcement learning. However, Q-learning suffers from overestimation bias where it uses the maximum Q-value of the next state-action pair for the TD update.
This model uses Double Q-learning to eliminate this problem by incorporating randomness. It trains the agent to solve the FrozenLake
environment from OpenAI's Gymnasium library by using two Q functions that are updated randomly according to the following update rule:
If ( Q_A ), then:
If ( Q_B ), then:
agent()
class:
-
learning_rate
: determines how fast model learns. ($\alpha$ ) -
gamma
: discount factor determines how much importance should future rewards have. ($\gamma$ ) -
epsilon
: used in$\epsilon$ -greedy policy to tune the exploration-exploitation tradeoff. As epsilon decrease, the policy becomes increasingly greedy. -
q
&dq
: two Q functions. -
e_greedy()
: given a value ofepsilon
, it chooses action according to the following probabilities:$P = 1 - \epsilon + \frac{\epsilon}{|\mathcal{A}|}$ for best-known action and$P = \frac{\epsilon}{|\mathcal{A}|}$ for remaining actions, where$|\mathcal{A}|$ is the total number of actions. -
get_best_action
: chooses action with the best Q-value for a given state. -
run_policy
: plays the learnt policy.
- Q-value table is plotted after the training and arrows are used to denote the favourability of an action.