Gridworld constants and reward size fix #103

casper2002casper · 2022-03-02T12:29:21Z

Gridworld stores its constants as part of its environment instead of declaring them as constants, differing from other examples.
It also uses rewards in range [-10, 10] which creates two problems, first it overpowers the exploration term of the uct formula and secondly the value function of the SimpleNet network is limited by the tanh activation function to [-1, 1] so it's unable to output the required value functions.
Before

After

10% better reward for AZ, however the NN doesn't improve further as the max reward is reached after the second iteration.

jonathan-laurent · 2022-03-02T15:03:23Z

Thanks! Do you have any idea why the gap is wider between the neural network and AlphaZero in the second case?

casper2002casper · 2022-03-02T15:07:36Z

I think because the NN doesn't get replaced as the AZ benchmark maxes out

jonathan-laurent · 2022-03-02T15:13:33Z

Got it. This makes sense!

casper2002casper added 2 commits March 2, 2022 13:14

Use constants

9acdb28

Use correct reward size

a4a8cf9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gridworld constants and reward size fix #103

Gridworld constants and reward size fix #103

casper2002casper commented Mar 2, 2022 •

edited

Loading

jonathan-laurent commented Mar 2, 2022

casper2002casper commented Mar 2, 2022

jonathan-laurent commented Mar 2, 2022

Gridworld constants and reward size fix #103

Are you sure you want to change the base?

Gridworld constants and reward size fix #103

Conversation

casper2002casper commented Mar 2, 2022 • edited Loading

jonathan-laurent commented Mar 2, 2022

casper2002casper commented Mar 2, 2022

jonathan-laurent commented Mar 2, 2022

casper2002casper commented Mar 2, 2022 •

edited

Loading