You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why do we add the log(rho_pi(s|z)) == pred_log_ratios and log(p(z)) == self.latent_ent_coef and not subtract them as in equation 3, sorry if this is obvious 😄
The text was updated successfully, but these errors were encountered:
I'm also wondering about the signs of the terms within SMM's intrinsic reward.
Regarding pred_log_ratios, I noticed that the VAE of the original SMM implementation returns the negated log_prob (= h_s_z) value.
And within the intrinsic reward it is negated again.
Hence, URLB's intrinsic reward might be correct w.r.t. the sign of h_s_z because URLB's VAE does not negate the log_prob in the first place.
Hey,
Not sure if anyone can clarify just wanted to check on signs with intrinsic reward for SMM
intr_reward = pred_log_ratios + self.latent_ent_coef * h_z + self.latent_cond_ent_coef * h_z_s.detach()
The original paper in equation 3 has:
r_z(s) = log(p*(s)) - log(rho_pi(s|z)) + log(p(z|s)) - log(p(z))
Why do we add the
log(rho_pi(s|z)) == pred_log_ratios
andlog(p(z)) == self.latent_ent_coef
and not subtract them as in equation 3, sorry if this is obvious 😄The text was updated successfully, but these errors were encountered: