-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does max_beta=0.999 in cosine schedule make any sense ? #42
Comments
I also noticed that the |
has this issue been fixed? |
@unrealwill sorry for my possibly trivial question, when you say:
What do you mean that they are trying to predict mu (as in eq 11 from improved diffusion). According to the code it seems to me that either they are predicting the noise (epsilon), or x_start(x_0). This can be seen here(line 410 of script_util.py): Am I missing something? Does it makes sense to you? Thanks, |
@stsavian I was just saying that the bug manifest itself more evidently when one is trying to predict epsilon instead of x_0. But even if the code converge when predicting x_0 despite of the bug, as far as I understand it should still hinders training performance because it's adding noise in an uncontrolled fashion. |
@unrealwill thanks for your kind reply! Indeed I am having some problems getting the model to converge. I would like to better explain my experiments to you. With my settings, I found that predicting the noise always gets me some wrong estimations in areas with uniform backgrounds, as in this #81; instead predicting x_0 seems to lead to better sampling. My data is a matrix of floats, normalizing the data (x_0) to 1 (max(x_o)=1) seems to reduce the phenomenon when used in conjunction with predicting the target, instead diving by the dataset standard deviation worsens performances. Also, the loss values (simpling predicting the MSE between target and x_t) can change a lot depending on the type of normalization. However, I find the loss values (multiplying factor) not particularly indicative of the produced quality. So, all of this is to say that: Hopefully, this could be helpful for someone and help me as well, |
@stsavian It seems you are in the numerical debugging phase of your development. The thing is that one bug can hide another, and it's not until you have eliminated them all that you will get good convergence. If you have convergence problem with linear schedule you probably have an additional bug which need to resolved first. For efficient bug hunting I usually like to make a sequence of increasingly complex code and dataset, starting from something as simple as possible that converge properly, and then morph it into something more complex while maintaining convergence along the way. I find it faster than fiddling with an exponential combination of settings (But if you have infinite compute you can probably spin-up grid search to find good settings). The cosine_schedule with max_beta=0.999 didn't make the cut to my code-base, it smells fishy to me, and I'd advise using a different default. |
@unrealwill thanks for your advice! I will make good use of it! |
@stsavian @unrealwill Can I ask what conclusion you came to/ended up doing here? I am having the same problem. When I predict the noise using the cosine noise schedule with beta max 0.999, the magnitude of samples from the reverse process scale with the number of diffusion steps (having order of magnitude in the hundreds or thousands). I don't generate sensible samples when I train the model in this way (my samples look like noise; if I predict x_0 instead of epsilon things look better). This is my first time implementing a DDPM so I'm not sure what i'm doing wrong. |
I'm curious about this. Does anyone have a response already? Is there an error in the paper? Also, should betas be clipped in both upper and lower bounds? Should there be a beta_min like 0? Or betas should be clip(betas, max_beta)? |
Refactored noise schedule logic since custom noise schedules often define alpha_t differently, as was the case for the cosine noise schedule paper. This meant the natural place for alpha_t to be computed was now inside the noise schedule funcs (prev called beta_t and renamed because they now return beta_t and alpha_t). Also meant changing the initialisation of ddpm model so that it could accept the alpha_t instead of calculate it inside its init. Added cosine noise schedule defined in: https://arxiv.org/pdf/2102.09672.pdf. Note: openai/guided-diffusion#42. Need to tune the s param for our image size but currently sticking with their default. Should be ok... Added const noise schedule for the purposes of q1b.
Refactored noise schedule logic since custom noise schedules often define alpha_t differently, as was the case for the cosine noise schedule paper. This meant the natural place for alpha_t to be computed was now inside the noise schedule funcs (prev called beta_t and renamed because they now return beta_t and alpha_t). Also meant changing the initialisation of ddpm model so that it could accept the alpha_t instead of calculate it inside its init. Added cosine noise schedule defined in: https://arxiv.org/pdf/2102.09672.pdf. Note: openai/guided-diffusion#42. Need to tune the s param for our image size but currently sticking with their default. Should be ok... Added const noise schedule for the purposes of q1b.
Hello,
guided-diffusion/guided_diffusion/gaussian_diffusion.py
Lines 36 to 45 in 8fb3ad9
I get this is a strict implementation of paper https://arxiv.org/pdf/2102.09672.pdf, but I don't understand how could max_beta=0.999 not be a bug.
In my personal loose implementation of this paper, I had to set max_beta = 0.02 which is the end point of the linear schedule, to get working results.
In equation (13) of the Improved Diffusion paper,
mu(xt,t) = 1/sqrt(alpha[t]) *( xt - beta[t] / sqrt(1-alphabar[t]) * eps(thetat,t)
At the start of the reverse diffusion process when t=max T ,
This mean that variance( mu(xt,t) ) ~ 30 this mean that the variance of x[t-1] ~ 30
In the paper equation (9)
The neural network inputs are trained with :
xt = sqrt(alphabar[t])*x0 + sqrt(1-alphabar[t])*eps
which is roughly of variance ~1This means that all the sampling will be made from sample with variance 30 while having been trained with variance around 1.
Even if the model normalize its input internally, it screws the ratio of the predicted variance and therefore the diffusion process is dominated by the first few terms, because the network will predict a variance ~30 times smaller.
In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.
But even predicting mu directly, if you predict mu correctly this mean you will get out of the training zone during the diffusion process (which you seem to mitigate with (dubious ?) clipping), and if you predict it incorrectly because its weight is low (by sheer luck?) it's just added noise to training process.
In the paper you explain that max_beta should be < 1 to avoid singularities, but can you clarify the reasoning for max_beta=0.999 in the range [0.02-0.999] ?
Thanks
The text was updated successfully, but these errors were encountered: