-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The algorithm framework differs from the DreamerV1 paper #226
Comments
Hi @LYK-love, as you said the algorithm you posted refers to the Dreamer-V1 one, which by the way is very similar to the one you shared here. The code you refer to is the one of Dreamer-V3, which is quite different from the V1 version, especially from the insights we have gained from looking at the authors code. |
Hi @LYK-love, In any case, I would like to answer your questions:
Perhaps what has led you astray is that in this implementation we do not have an outer loop with two loops inside (one for environment interaction and the other for training), but we do have a loop for environment interaction and inside it, we check whether we need to carry out training. This choice was made to follow the original repository as closely as possible. So the structure of our code is (let me change the name of the variables, for better understanding): # counter of policy steps played, when zero, then you have to train the agent
initialize env_interaction_steps_between_trainings
for i in total_steps:
env.step(action)
env_interaction_steps_between_trainings -= 1
if train_started and env_interaction_steps_between_trainings <= 0:
for j in per_rank_gradient_steps:
train()
reset env_interaction_steps_between_trainings Another thing I noticed is that |
Thanks! I'll do more work on sheeprl and try to reproduce the results of the oiginial paper. |
@michele-milesi So for each iteration in |
Yeah, at each iteration of the outer for-loop, one step of env interaction is performed. |
Hello, I found followig code in
sheeprl/algos/dreamer_v3.py
:Based on my understanding, the training procedure of DreamerV3 is the same as DreamerV1, as is shown in DreamerV1's paper:
Basically, we need to:
while not converged
loop, in each iteration we:for i in range(update_steps)
loop, in each iteration we:So, I think sheeprl's
train()
function is the same as: one time of dynamic learning + one time of behavior learning. It should be called forupdate_steps
for a for loop, and the for loop should be called multiple times before the agent converges.However, in the code I provided in the beginning, I didn't see the
train()
function is called forupdate_steps
for a for loop, and I didn't see the outermost while loop. Meanwhile, I didn't find that, after each for loop, an episode is collected and added to the replay buffer.I think sheeprl's implementation is a little different from the paper. Can you explain it?
What's more, can you explain what are :
train_step
,per_rank_gradient_steps
,if update >= learning_starts and updates_before_training <= 0
,updates_before_training
,num_updates
?I also can't understand the logic of this piece of code
Thanks!
The text was updated successfully, but these errors were encountered: