Does max_beta=0.999 in cosine schedule make any sense ? #42

unrealwill · 2022-06-23T11:22:15Z

Hello,

guided-diffusion/guided_diffusion/gaussian_diffusion.py

Lines 36 to 45 in 8fb3ad9

    
               elif schedule_name == "cosine": 
        
                   return betas_for_alpha_bar( 
        
                       num_diffusion_timesteps, 
        
                       lambda t: math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2, 
        
                   ) 
        
               else: 
        
                   raise NotImplementedError(f"unknown beta schedule: {schedule_name}") 
        
           def betas_for_alpha_bar(num_diffusion_timesteps, alpha_bar, max_beta=0.999):

I get this is a strict implementation of paper https://arxiv.org/pdf/2102.09672.pdf, but I don't understand how could max_beta=0.999 not be a bug.

In my personal loose implementation of this paper, I had to set max_beta = 0.02 which is the end point of the linear schedule, to get working results.

In equation (13) of the Improved Diffusion paper,
mu(xt,t) = 1/sqrt(alpha[t]) *( xt - beta[t] / sqrt(1-alphabar[t]) * eps(thetat,t)

At the start of the reverse diffusion process when t=max T ,

xt is Normal(0,1), 
eps aims to be Normal(0,1), 
beta[t] = clipped_value = 0.999,
alpha[t] = 1-beta[t] = 0.001,
1/sqrt(alpha[t]) ~ 31.6,
alphabar[t] ~0 because it's forgetting the initial x0
beta[t] / sqrt(1-alphabar[t])  ~ 1

This mean that variance( mu(xt,t) ) ~ 30 this mean that the variance of x[t-1] ~ 30

In the paper equation (9)
The neural network inputs are trained with :
xt = sqrt(alphabar[t])*x0 + sqrt(1-alphabar[t])*eps which is roughly of variance ~1

This means that all the sampling will be made from sample with variance 30 while having been trained with variance around 1.
Even if the model normalize its input internally, it screws the ratio of the predicted variance and therefore the diffusion process is dominated by the first few terms, because the network will predict a variance ~30 times smaller.

In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.

But even predicting mu directly, if you predict mu correctly this mean you will get out of the training zone during the diffusion process (which you seem to mitigate with (dubious ?) clipping), and if you predict it incorrectly because its weight is low (by sheer luck?) it's just added noise to training process.

In the paper you explain that max_beta should be < 1 to avoid singularities, but can you clarify the reasoning for max_beta=0.999 in the range [0.02-0.999] ?

Thanks

The text was updated successfully, but these errors were encountered:

singwang-cn · 2022-11-28T06:23:30Z

I also noticed that the x_(t-1) will increase to a very large value causing the failure of the generation because of 1/sqrt(alpha[t]) ~ 31.6 (both DDPM and DDIM generation suffer the same problem). A temporary solution is to skip the last 20-40 time steps in generation. I am still finding a solution to deal with this problem.

MaxxP0 · 2023-01-04T13:25:21Z

has this issue been fixed?

stsavian · 2023-02-09T16:08:29Z

@unrealwill sorry for my possibly trivial question, when you say:

In my personal loose implementation I have decided to use the prediction of the noise (Ho-style) instead of the prediction of mu as you seem to have chosen here, and therefore I am much more sensitive to this bug.

What do you mean that they are trying to predict mu (as in eq 11 from improved diffusion). According to the code it seems to me that either they are predicting the noise (epsilon), or x_start(x_0). This can be seen here(line 410 of script_util.py):
model_mean_type=( gd.ModelMeanType.EPSILON if not predict_xstart else gd.ModelMeanType.START_X

Am I missing something? Does it makes sense to you?

Thanks,
Stefano

unrealwill · 2023-02-11T09:03:56Z

@stsavian I was just saying that the bug manifest itself more evidently when one is trying to predict epsilon instead of x_0. But even if the code converge when predicting x_0 despite of the bug, as far as I understand it should still hinders training performance because it's adding noise in an uncontrolled fashion.

stsavian · 2023-02-11T09:43:45Z

@unrealwill thanks for your kind reply! Indeed I am having some problems getting the model to converge. I would like to better explain my experiments to you. With my settings, I found that predicting the noise always gets me some wrong estimations in areas with uniform backgrounds, as in this #81; instead predicting x_0 seems to lead to better sampling.
After seeing your issue I have tried to compare the linear schedule and cosine schedule, to see if performance changes. To me, both schedules lead to the same performance. So I wonder if the problem was max_beta=0.999 with the cosine schedule.
I think there might be a complex interplay between the noise training schedule, sampling steps, and the type of data.

My data is a matrix of floats, normalizing the data (x_0) to 1 (max(x_o)=1) seems to reduce the phenomenon when used in conjunction with predicting the target, instead diving by the dataset standard deviation worsens performances. Also, the loss values (simpling predicting the MSE between target and x_t) can change a lot depending on the type of normalization. However, I find the loss values (multiplying factor) not particularly indicative of the produced quality.

So, all of this is to say that:
i) I am having trouble understanding if there should be a specific relationship between the input data values (x_0) and the noise added (beta);
ii) predicting the noise is supposed to be equivalent to predicting x_0, so am I stunting my model with certain hyperparameters?
iii) I am now running some experiments with extreme schedules, e.g. linear with very low beta (or very large beta); cosine for different betas.

Hopefully, this could be helpful for someone and help me as well,
Stefano

unrealwill · 2023-02-11T11:29:05Z

@stsavian It seems you are in the numerical debugging phase of your development. The thing is that one bug can hide another, and it's not until you have eliminated them all that you will get good convergence. If you have convergence problem with linear schedule you probably have an additional bug which need to resolved first.

For efficient bug hunting I usually like to make a sequence of increasingly complex code and dataset, starting from something as simple as possible that converge properly, and then morph it into something more complex while maintaining convergence along the way. I find it faster than fiddling with an exponential combination of settings (But if you have infinite compute you can probably spin-up grid search to find good settings). The cosine_schedule with max_beta=0.999 didn't make the cut to my code-base, it smells fishy to me, and I'd advise using a different default.

stsavian · 2023-02-11T19:58:24Z

@unrealwill thanks for your advice! I will make good use of it!

jamesheald · 2023-07-21T08:15:34Z

@stsavian @unrealwill Can I ask what conclusion you came to/ended up doing here? I am having the same problem. When I predict the noise using the cosine noise schedule with beta max 0.999, the magnitude of samples from the reverse process scale with the number of diffusion steps (having order of magnitude in the hundreds or thousands). I don't generate sensible samples when I train the model in this way (my samples look like noise; if I predict x_0 instead of epsilon things look better). This is my first time implementing a DDPM so I'm not sure what i'm doing wrong.

joaolcguerreiro · 2023-08-26T20:42:32Z

I'm curious about this. Does anyone have a response already? Is there an error in the paper?

Also, should betas be clipped in both upper and lower bounds? Should there be a beta_min like 0? Or betas should be clip(betas, max_beta)?

Refactored noise schedule logic since custom noise schedules often define alpha_t differently, as was the case for the cosine noise schedule paper. This meant the natural place for alpha_t to be computed was now inside the noise schedule funcs (prev called beta_t and renamed because they now return beta_t and alpha_t). Also meant changing the initialisation of ddpm model so that it could accept the alpha_t instead of calculate it inside its init. Added cosine noise schedule defined in: https://arxiv.org/pdf/2102.09672.pdf. Note: openai/guided-diffusion#42. Need to tune the s param for our image size but currently sticking with their default. Should be ok... Added const noise schedule for the purposes of q1b.

stsavian mentioned this issue Feb 16, 2023

Quality of the generated images cloneofsimo/minDiffusion#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does max_beta=0.999 in cosine schedule make any sense ? #42

Does max_beta=0.999 in cosine schedule make any sense ? #42

unrealwill commented Jun 23, 2022 •

edited

Loading

singwang-cn commented Nov 28, 2022

MaxxP0 commented Jan 4, 2023

stsavian commented Feb 9, 2023

unrealwill commented Feb 11, 2023

stsavian commented Feb 11, 2023

unrealwill commented Feb 11, 2023

stsavian commented Feb 11, 2023

jamesheald commented Jul 21, 2023

joaolcguerreiro commented Aug 26, 2023 •

edited

Loading

Does max_beta=0.999 in cosine schedule make any sense ? #42

Does max_beta=0.999 in cosine schedule make any sense ? #42

Comments

unrealwill commented Jun 23, 2022 • edited Loading

singwang-cn commented Nov 28, 2022

MaxxP0 commented Jan 4, 2023

stsavian commented Feb 9, 2023

unrealwill commented Feb 11, 2023

stsavian commented Feb 11, 2023

unrealwill commented Feb 11, 2023

stsavian commented Feb 11, 2023

jamesheald commented Jul 21, 2023

joaolcguerreiro commented Aug 26, 2023 • edited Loading

unrealwill commented Jun 23, 2022 •

edited

Loading

joaolcguerreiro commented Aug 26, 2023 •

edited

Loading