You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
get_gae_advantages already includes discount factors, then is later multiplied by cum_discount, which is another discount factor. Thus, the discount factor is counted twice. I may have been confused by the bottom of page 5 in the Loaded DiCE paper (Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Estimators for Reinforcement Learning), where the formula includes both a cumulative discount and an advantage (which I took to be the GAE which includes discount), and didn't follow the common practice of omitting the cumulative discount.
The implication of this is that my discount factor may be applied twice, compared to common practice, so for a given discount factor e.g. 0.96, there might be more discounting going on than you might otherwise expect. I expect that this doesn't materially change overall results, though it might affect learning dynamics, and might be confusing or inconsistent when comparing discount factors with other codebases.
I'm leaving things as is, even though the fix is very quick (e.g. just remove cum_discount in https://github.com/Silent-Zebra/POLA/blob/master/jax_files/POLA_dice_jax.py#L85), because I don't have time to rerun experiments now, and also I don't expect results to materially change (and even if they do, I could likely use a different discount factor to get similar results).
The text was updated successfully, but these errors were encountered:
Thanks to @cool-RR for pointing this out.
get_gae_advantages already includes discount factors, then is later multiplied by cum_discount, which is another discount factor. Thus, the discount factor is counted twice. I may have been confused by the bottom of page 5 in the Loaded DiCE paper (Loaded DiCE: Trading off Bias and Variance in Any-Order Score Function Estimators for Reinforcement Learning), where the formula includes both a cumulative discount and an advantage (which I took to be the GAE which includes discount), and didn't follow the common practice of omitting the cumulative discount.
The implication of this is that my discount factor may be applied twice, compared to common practice, so for a given discount factor e.g. 0.96, there might be more discounting going on than you might otherwise expect. I expect that this doesn't materially change overall results, though it might affect learning dynamics, and might be confusing or inconsistent when comparing discount factors with other codebases.
I'm leaving things as is, even though the fix is very quick (e.g. just remove cum_discount in https://github.com/Silent-Zebra/POLA/blob/master/jax_files/POLA_dice_jax.py#L85), because I don't have time to rerun experiments now, and also I don't expect results to materially change (and even if they do, I could likely use a different discount factor to get similar results).
The text was updated successfully, but these errors were encountered: