-
Hello OpenSpiel team, I think there might be a bug in the tabular q-learning algorithm. In the function I suppose this is used to alternate negative/positive q values based on the current player (this seems to be suggested also by the comment above this line of code). Though, it looks like the value from I have noticed something else. In games like Tic Tac Toe, we're never going to have negative rewards in (linked here) This is because the player taking the last action can only end up in a draw or victory, after which the game is ended. Is this going to affect the updates of the q values somehow? I have a feeling that the logic above where we change the sign of the next q value compensates for this, but I'm having hard times verifying it. Wouldn't it more appropriate to update the Q table only when a specified player is taking actions, and finding a way to add the negative reward when the game is lost? Apologies if this is exactly what it's been done already. I'm still doing some readings regarding q-learning in two-players zero-sum games (links below if people are interested), to understand how tabular q-learning works in zero-sum games. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Hi @giogix2, I'm having a hard time understanding where the bug could be. So I'll follow-up with an explanation. Maybe you can try to get an example where you think the update rule is wrong? First important thing to note is that this is a "two-player zero-sum" specialized variant of Q-learning (e.g. specifically, it would be different from two independent Q-learners playing against each other -- that's critical).
The q-values, e.g. q(s,a), stored in the table are always in view of whose turn it is at s. So, when you call GetBestActionValue(s'), that will be the best action from the player at s'. So you have to do the check in the code. You're right, in a strictly alternating game like Tic-Tac-Toe this will always be -1, and that is intended. But in a gridworld (single-agent problem), this will always be +1. In a game where the same player can move several times (e.g. backgammon), then this update rule will be correct there too. The Q-learning update rule requires that the Q-values always be interpreted from the same perspective and it's just applying a two-player zero-sum identity of q_0(s,a) = - q_1(s,a).
This is also a correct interpretation because the game can only end in a win or draw, and precisely why you have to use the identity above. A move will only start to be recognized as a losing move, say a from s, when the player that follows (at s') recognizes the win from s' (say a*), and then its q(s', a*) will be become positive and converge to 1. a* will be selected in GetBestActionValue() along with its value at s' (1), and it will get inverted at s. So q(s,a) will be then always get values of -1, which is correct since a is a losing move. I suspect what's confusing you is that it's the two-player zero-sum variant (using a single table) rather than independent Q-learning (which would use two separate tables). But it's really just (sample-based) value iteration while applying the two-player zero-sum identity to interpret the q-values properly. I gave a tutorial a while back that might help (check out https://www.youtube.com/watch?v=rbZBBTLH32o, around 1:02:00 to 1:19:00) which goes through the value iteration version of this algorithm.. which is probably easier to understand rather than the Q-learning one. |
Beta Was this translation helpful? Give feedback.
-
Hi @lanctot Thank you for the explanation. It's clear now. It's definitely a feature, not a bug.
This was exactly my problem. I was stubbornly trying to find a link to the independent q-learning from the textbooks.
This is also key. I wasn't thinking about the single-agent problem at all. It's clear now.
Regarding the reward being always positive, I understand now what was the issue in my thought process. Off-topic: I was trying to add eligibility traces to this algorithm. It looks to me that they can be implemented for the two-player zero-sum variant without majour differences from the single-player variant. |
Beta Was this translation helpful? Give feedback.
Hi @giogix2,
I'm having a hard time understanding where the bug could be. So I'll follow-up with an explanation. Maybe you can try to get an example where you think the update rule is wrong?
First important thing to note is that this is a "two-player zero-sum" specialized variant of Q-learning (e.g. specifically, it would be different from two independent Q-learners playing against each other -- that's critical).