You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was having a look at the build_td_lambda_targets function into utils/rl_utils.py, and I was wondering if line 8: ret[:, -1] = target_qs[:, -1] * (1 - th.sum(terminated, dim=1))
is really correct.
This line should initialize the TD(lambda) targets for a given episode's last step (and the episode could possibly not been finished yet), but shouldn't it be something like (as it is in the for loop after line 8): ret[:, -1] = target_qs[:, -1] * (1 - terminated[:, -1])
so that, for each batch sample and each agent, the last time step is really multiplied by its own "boolean flag" terminated rather than a sum of these (that does not seem to make sense to me here). Am I interpreting your code wrong?
The text was updated successfully, but these errors were encountered:
For PyMARL, every episodes are terminated before training, so there is no problem as described above.
We also recommend our finetuned qmix: https://github.com/hijkzzz/pymarl2.
Yes, I agree. Not 100% sure if this IN PRACTICE is going to make any difference in the computed results when using SMAC, still I think that CONCEPTUALLY it is wrong (if my understanding of the code is right of course). This can also lead to some computational errors when using other environments than SMAC that may have a different handling of epidoses ending.
I was having a look at the
build_td_lambda_targets
function intoutils/rl_utils.py
, and I was wondering if line 8:ret[:, -1] = target_qs[:, -1] * (1 - th.sum(terminated, dim=1))
is really correct.
This line should initialize the TD(lambda) targets for a given episode's last step (and the episode could possibly not been finished yet), but shouldn't it be something like (as it is in the
for
loop after line 8):ret[:, -1] = target_qs[:, -1] * (1 - terminated[:, -1])
so that, for each batch sample and each agent, the last time step is really multiplied by its own "boolean flag"
terminated
rather than a sum of these (that does not seem to make sense to me here). Am I interpreting your code wrong?The text was updated successfully, but these errors were encountered: