About the TD(lambda) computation #109

opocaj92 · 2021-06-10T13:41:32Z

I was having a look at the build_td_lambda_targets function into utils/rl_utils.py, and I was wondering if line 8:
ret[:, -1] = target_qs[:, -1] * (1 - th.sum(terminated, dim=1))
is really correct.
This line should initialize the TD(lambda) targets for a given episode's last step (and the episode could possibly not been finished yet), but shouldn't it be something like (as it is in the for loop after line 8):
ret[:, -1] = target_qs[:, -1] * (1 - terminated[:, -1])
so that, for each batch sample and each agent, the last time step is really multiplied by its own "boolean flag" terminated rather than a sum of these (that does not seem to make sense to me here). Am I interpreting your code wrong?

The text was updated successfully, but these errors were encountered:

hijkzzz · 2021-08-06T14:17:43Z

For PyMARL, every episodes are terminated before training, so there is no problem as described above.
We also recommend our finetuned qmix: https://github.com/hijkzzz/pymarl2.

opocaj92 · 2021-08-10T14:11:05Z

Yes, I agree. Not 100% sure if this IN PRACTICE is going to make any difference in the computed results when using SMAC, still I think that CONCEPTUALLY it is wrong (if my understanding of the code is right of course). This can also lead to some computational errors when using other environments than SMAC that may have a different handling of epidoses ending.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the TD(lambda) computation #109

About the TD(lambda) computation #109

opocaj92 commented Jun 10, 2021

hijkzzz commented Aug 6, 2021 •

edited

Loading

opocaj92 commented Aug 10, 2021

About the TD(lambda) computation #109

About the TD(lambda) computation #109

Comments

opocaj92 commented Jun 10, 2021

hijkzzz commented Aug 6, 2021 • edited Loading

opocaj92 commented Aug 10, 2021

hijkzzz commented Aug 6, 2021 •

edited

Loading