Possible bug in tabular q-learning algorithm #1026

giogix2 · 2023-03-05T18:47:04Z

giogix2
Mar 5, 2023

Hello OpenSpiel team,

I think there might be a bug in the tabular q-learning algorithm. In the function RunIteration(), the following line is used to update value next_q_value in the q table update (linked here):
const double next_q_value = (player != next_state->CurrentPlayer() ? -1 : 1) * GetBestActionValue(*next_state, min_utility);

I suppose this is used to alternate negative/positive q values based on the current player (this seems to be suggested also by the comment above this line of code). Though, it looks like the value from GetBestActionValue() will be always multiplied by -1, because player is never equal to next_state->CurrentPlayer(), since the current player and the player in the next state are always different (at least in games like tic tac toe).
Maybe the right way of doing this is to fix the player, as in the example below?
const double next_q_value = (player == 0 ? -1 : 1) * GetBestActionValue(*next_state, min_utility);

I have noticed something else. In games like Tic Tac Toe, we're never going to have negative rewards in (linked here)
const double reward = next_state->Rewards()[player];

This is because the player taking the last action can only end up in a draw or victory, after which the game is ended. Is this going to affect the updates of the q values somehow? I have a feeling that the logic above where we change the sign of the next q value compensates for this, but I'm having hard times verifying it.

Wouldn't it more appropriate to update the Q table only when a specified player is taking actions, and finding a way to add the negative reward when the game is lost? Apologies if this is exactly what it's been done already. I'm still doing some readings regarding q-learning in two-players zero-sum games (links below if people are interested), to understand how tabular q-learning works in zero-sum games.

Answered by lanctot

Mar 6, 2023

Hi @giogix2,

I'm having a hard time understanding where the bug could be. So I'll follow-up with an explanation. Maybe you can try to get an example where you think the update rule is wrong?

First important thing to note is that this is a "two-player zero-sum" specialized variant of Q-learning (e.g. specifically, it would be different from two independent Q-learners playing against each other -- that's critical).

I think there might be a bug in the tabular q-learning algorithm. In the function RunIteration(), the following line is used to update value next_q_value in the q table update (linked here): const double next_q_value = (player != next_state->CurrentPlayer() ? -1 : 1) * GetBestAc…

View full answer

lanctot · 2023-03-06T12:39:54Z

lanctot
Mar 6, 2023
Maintainer

Hi @giogix2,

I'm having a hard time understanding where the bug could be. So I'll follow-up with an explanation. Maybe you can try to get an example where you think the update rule is wrong?

First important thing to note is that this is a "two-player zero-sum" specialized variant of Q-learning (e.g. specifically, it would be different from two independent Q-learners playing against each other -- that's critical).

I think there might be a bug in the tabular q-learning algorithm. In the function RunIteration(), the following line is used to update value next_q_value in the q table update (linked here): const double next_q_value = (player != next_state->CurrentPlayer() ? -1 : 1) * GetBestActionValue(*next_state, min_utility);

I suppose this is used to alternate negative/positive q values based on the current player (this seems to be suggested also by the comment above this line of code). Though, it looks like the value from GetBestActionValue() will be always multiplied by -1, because player is never equal to next_state->CurrentPlayer(), since the current player and the player in the next state are always different (at least in games like tic tac toe).

The q-values, e.g. q(s,a), stored in the table are always in view of whose turn it is at s. So, when you call GetBestActionValue(s'), that will be the best action from the player at s'. So you have to do the check in the code. You're right, in a strictly alternating game like Tic-Tac-Toe this will always be -1, and that is intended. But in a gridworld (single-agent problem), this will always be +1.

In a game where the same player can move several times (e.g. backgammon), then this update rule will be correct there too. The Q-learning update rule requires that the Q-values always be interpreted from the same perspective and it's just applying a two-player zero-sum identity of q_0(s,a) = - q_1(s,a).

I have noticed something else. In games like Tic Tac Toe, we're never going to have negative rewards in (linked here) const double reward = next_state->Rewards()[player];

This is also a correct interpretation because the game can only end in a win or draw, and precisely why you have to use the identity above. A move will only start to be recognized as a losing move, say a from s, when the player that follows (at s') recognizes the win from s' (say a*), and then its q(s', a*) will be become positive and converge to 1. a* will be selected in GetBestActionValue() along with its value at s' (1), and it will get inverted at s. So q(s,a) will be then always get values of -1, which is correct since a is a losing move.

I suspect what's confusing you is that it's the two-player zero-sum variant (using a single table) rather than independent Q-learning (which would use two separate tables). But it's really just (sample-based) value iteration while applying the two-player zero-sum identity to interpret the q-values properly.

I gave a tutorial a while back that might help (check out https://www.youtube.com/watch?v=rbZBBTLH32o, around 1:02:00 to 1:19:00) which goes through the value iteration version of this algorithm.. which is probably easier to understand rather than the Q-learning one.

0 replies

giogix2 · 2023-03-07T08:30:49Z

giogix2
Mar 7, 2023
Author

Hi @lanctot

Thank you for the explanation. It's clear now. It's definitely a feature, not a bug.
The video you've shared helped a lot.

I suspect what's confusing you is that it's the two-player zero-sum variant (using a single table) rather than independent Q-learning (which would use two separate tables).

This was exactly my problem. I was stubbornly trying to find a link to the independent q-learning from the textbooks.

You're right, in a strictly alternating game like Tic-Tac-Toe this will always be -1, and that is intended. But in a gridworld (single-agent problem), this will always be +1.

This is also key. I wasn't thinking about the single-agent problem at all. It's clear now.

So q(s,a) will be then always get values of -1, which is correct since a is a losing move.

Regarding the reward being always positive, I understand now what was the issue in my thought process.

Off-topic: I was trying to add eligibility traces to this algorithm. It looks to me that they can be implemented for the two-player zero-sum variant without majour differences from the single-player variant.

1 reply

lanctot Mar 8, 2023
Maintainer

Cool. Yep, I don't think it'd be too hard as long as the sign of the q-values in the updates are ensured to be correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug in tabular q-learning algorithm #1026

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Possible bug in tabular q-learning algorithm #1026

giogix2 Mar 5, 2023

Replies: 2 comments · 1 reply

lanctot Mar 6, 2023 Maintainer

giogix2 Mar 7, 2023 Author

lanctot Mar 8, 2023 Maintainer

giogix2
Mar 5, 2023

Replies: 2 comments 1 reply

lanctot
Mar 6, 2023
Maintainer

giogix2
Mar 7, 2023
Author

lanctot Mar 8, 2023
Maintainer