Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does max_beta=0.999 in cosine schedule make any sense ? #42

Open
unrealwill opened this issue Jun 23, 2022 · 9 comments
Open

Does max_beta=0.999 in cosine schedule make any sense ? #42

unrealwill opened this issue Jun 23, 2022 · 9 comments

Comments

@unrealwill
Copy link

unrealwill commented Jun 23, 2022

Hello,

elif schedule_name == "cosine":
return betas_for_alpha_bar(
num_diffusion_timesteps,
lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2,
)
else:
raise NotImplementedError(f"unknown beta schedule: {schedule_name}")
def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):

I get this is a strict implementation of paper https://arxiv.org/pdf/2102.09672.pdf, but I don't understand how could max_beta=0.999 not be a bug.

In my personal loose implementation of this paper, I had to set max_beta = 0.02 which is the end point of the linear schedule, to get working results.

In equation (13) of the Improved Diffusion paper,
mu(xt,t) = 1/sqrt(alpha[t]) *( xt - beta[t] / sqrt(1-alphabar[t]) * eps(thetat,t)

At the start of the reverse diffusion process when t=max T ,

xt is Normal(0,1), 
eps aims to be Normal(0,1), 
beta[t] = clipped_value = 0.999,
alpha[t] = 1-beta[t] = 0.001,
1/sqrt(alpha[t]) ~ 31.6,
alphabar[t] ~0 because it's forgetting the initial x0
beta[t] / sqrt(1-alphabar[t])  ~ 1

This mean that variance( mu(xt,t) ) ~ 30 this mean that the variance of x[t-1] ~ 30

In the paper equation (9)
The neural network inputs are trained with :
xt = sqrt(alphabar[t])*x0 + sqrt(1-alphabar[t])*eps which is roughly of variance ~1

This means that all the sampling will be made from sample with variance 30 while having been trained with variance around 1.
Even if the model normalize its input internally, it screws the ratio of the predicted variance and therefore the diffusion process is dominated by the first few terms, because the network will predict a variance ~30 times smaller.

In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.

But even predicting mu directly, if you predict mu correctly this mean you will get out of the training zone during the diffusion process (which you seem to mitigate with (dubious ?) clipping), and if you predict it incorrectly because its weight is low (by sheer luck?) it's just added noise to training process.

In the paper you explain that max_beta should be < 1 to avoid singularities, but can you clarify the reasoning for max_beta=0.999 in the range [0.02-0.999] ?

Thanks

@singwang-cn
Copy link

I also noticed that the x_(t-1) will increase to a very large value causing the failure of the generation because of 1/sqrt(alpha[t]) ~ 31.6 (both DDPM and DDIM generation suffer the same problem). A temporary solution is to skip the last 20-40 time steps in generation. I am still finding a solution to deal with this problem.

@MaxxP0
Copy link

MaxxP0 commented Jan 4, 2023

has this issue been fixed?

@stsavian
Copy link

stsavian commented Feb 9, 2023

@unrealwill sorry for my possibly trivial question, when you say:

In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.

What do you mean that they are trying to predict mu (as in eq 11 from improved diffusion). According to the code it seems to me that either they are predicting the noise (epsilon), or x_start(x_0). This can be seen here(line 410 of script_util.py):
model_mean_type=( gd.ModelMeanType.EPSILON if not predict_xstart else gd.ModelMeanType.START_X

Am I missing something? Does it makes sense to you?

Thanks,
Stefano

@unrealwill
Copy link
Author

@stsavian I was just saying that the bug manifest itself more evidently when one is trying to predict epsilon instead of x_0. But even if the code converge when predicting x_0 despite of the bug, as far as I understand it should still hinders training performance because it's adding noise in an uncontrolled fashion.

@stsavian
Copy link

@unrealwill thanks for your kind reply! Indeed I am having some problems getting the model to converge. I would like to better explain my experiments to you. With my settings, I found that predicting the noise always gets me some wrong estimations in areas with uniform backgrounds, as in this #81; instead predicting x_0 seems to lead to better sampling.
After seeing your issue I have tried to compare the linear schedule and cosine schedule, to see if performance changes. To me, both schedules lead to the same performance. So I wonder if the problem was max_beta=0.999 with the cosine schedule.
I think there might be a complex interplay between the noise training schedule, sampling steps, and the type of data.

My data is a matrix of floats, normalizing the data (x_0) to 1 (max(x_o)=1) seems to reduce the phenomenon when used in conjunction with predicting the target, instead diving by the dataset standard deviation worsens performances. Also, the loss values (simpling predicting the MSE between target and x_t) can change a lot depending on the type of normalization. However, I find the loss values (multiplying factor) not particularly indicative of the produced quality.

So, all of this is to say that:
i) I am having trouble understanding if there should be a specific relationship between the input data values (x_0) and the noise added (beta);
ii) predicting the noise is supposed to be equivalent to predicting x_0, so am I stunting my model with certain hyperparameters?
iii) I am now running some experiments with extreme schedules, e.g. linear with very low beta (or very large beta); cosine for different betas.

Hopefully, this could be helpful for someone and help me as well,
Stefano

@unrealwill
Copy link
Author

@stsavian It seems you are in the numerical debugging phase of your development. The thing is that one bug can hide another, and it's not until you have eliminated them all that you will get good convergence. If you have convergence problem with linear schedule you probably have an additional bug which need to resolved first.

For efficient bug hunting I usually like to make a sequence of increasingly complex code and dataset, starting from something as simple as possible that converge properly, and then morph it into something more complex while maintaining convergence along the way. I find it faster than fiddling with an exponential combination of settings (But if you have infinite compute you can probably spin-up grid search to find good settings). The cosine_schedule with max_beta=0.999 didn't make the cut to my code-base, it smells fishy to me, and I'd advise using a different default.

@stsavian
Copy link

@unrealwill thanks for your advice! I will make good use of it!

@jamesheald
Copy link

@stsavian @unrealwill Can I ask what conclusion you came to/ended up doing here? I am having the same problem. When I predict the noise using the cosine noise schedule with beta max 0.999, the magnitude of samples from the reverse process scale with the number of diffusion steps (having order of magnitude in the hundreds or thousands). I don't generate sensible samples when I train the model in this way (my samples look like noise; if I predict x_0 instead of epsilon things look better). This is my first time implementing a DDPM so I'm not sure what i'm doing wrong.

@joaolcguerreiro
Copy link

joaolcguerreiro commented Aug 26, 2023

I'm curious about this. Does anyone have a response already? Is there an error in the paper?

Also, should betas be clipped in both upper and lower bounds? Should there be a beta_min like 0? Or betas should be clip(betas, max_beta)?

VishalJ99 added a commit to VishalJ99/m2_cw that referenced this issue Mar 18, 2024
Refactored noise schedule logic since custom noise schedules often define
alpha_t differently, as was the case for the cosine noise schedule paper.

This meant the natural place for alpha_t to be computed was now inside
the noise schedule funcs (prev called beta_t and renamed because they now return
beta_t and alpha_t).

Also meant changing the initialisation of ddpm model so that it could accept the
alpha_t instead of calculate it inside its init.

Added cosine noise schedule defined in: https://arxiv.org/pdf/2102.09672.pdf.
Note: openai/guided-diffusion#42.

Need to tune the s param for our image size but currently sticking with their default.
Should be ok...

Added const noise schedule for the purposes of q1b.
VishalJ99 added a commit to VishalJ99/diffusion_model_cw that referenced this issue Mar 29, 2024
Refactored noise schedule logic since custom noise schedules often define
alpha_t differently, as was the case for the cosine noise schedule paper.

This meant the natural place for alpha_t to be computed was now inside
the noise schedule funcs (prev called beta_t and renamed because they now return
beta_t and alpha_t).

Also meant changing the initialisation of ddpm model so that it could accept the
alpha_t instead of calculate it inside its init.

Added cosine noise schedule defined in: https://arxiv.org/pdf/2102.09672.pdf.
Note: openai/guided-diffusion#42.

Need to tune the s param for our image size but currently sticking with their default.
Should be ok...

Added const noise schedule for the purposes of q1b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants