nan occurs when training ImageNet 128x128 #50

ShoufaChen · 2022-07-12T07:34:06Z

Thanks for your awesome work and open source.

I met the nan issue when training on ImageNet 128x128,

-----------------------------
| lg_loss_scale | -1.62e+04 |
| loss          | nan       |
| loss_q0       | nan       |
| loss_q1       | nan       |
| loss_q2       | nan       |
| loss_q3       | nan       |
| mse           | nan       |
| mse_q0        | nan       |
| mse_q1        | nan       |
| mse_q2        | nan       |
| mse_q3        | nan       |
| samples       | 3.92e+07  |
| step          | 1.53e+05  |
| vb            | nan       |
| vb_q0         | nan       |
| vb_q1         | nan       |
| vb_q2         | nan       |
| vb_q3         | nan       |
-----------------------------
Found NaN, decreased lg_loss_scale to -16199.354
Found NaN, decreased lg_loss_scale to -16200.354
Found NaN, decreased lg_loss_scale to -16201.354
Found NaN, decreased lg_loss_scale to -16202.354
Found NaN, decreased lg_loss_scale to -16203.354

I used fp16. Did you meet similar issues?

Thanks in advance.

The text was updated successfully, but these errors were encountered:

unixpickle · 2022-07-12T17:59:41Z

Hi Shoufa, Could you please send the exact command you are running for training? This is indeed a NaN during the forward pass (hence losses are NaN), which looks like a divergence.

…

On Tue, Jul 12, 2022 at 12:34 AM Shoufa Chen ***@***.***> wrote: Hi, @unixpickle <https://github.com/unixpickle> Thanks for your awesome work and open source. I met the nan issue when training on ImageNet 128x128, ----------------------------- | lg_loss_scale | -1.62e+04 | | loss | nan | | loss_q0 | nan | | loss_q1 | nan | | loss_q2 | nan | | loss_q3 | nan | | mse | nan | | mse_q0 | nan | | mse_q1 | nan | | mse_q2 | nan | | mse_q3 | nan | | samples | 3.92e+07 | | step | 1.53e+05 | | vb | nan | | vb_q0 | nan | | vb_q1 | nan | | vb_q2 | nan | | vb_q3 | nan | ----------------------------- Found NaN, decreased lg_loss_scale to -16199.354 Found NaN, decreased lg_loss_scale to -16200.354 Found NaN, decreased lg_loss_scale to -16201.354 Found NaN, decreased lg_loss_scale to -16202.354 Found NaN, decreased lg_loss_scale to -16203.354 I used fp16. Did you meet similar issues? Thanks in advance. — Reply to this email directly, view it on GitHub <#50>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADDEBJI44CYURC4FKFL2G3VTUNXTANCNFSM53J7YVXA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ShoufaChen · 2022-07-13T00:56:10Z

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

unixpickle · 2022-07-13T23:10:00Z

Do you have a record of the loss before the NaN occurred? Did it spike right before NaNs started happening?

Your command itself looks good to me, so I don't think it's a simple hyperparameter issue. Also, have you tried looking at samples from before the divergence, as a sanity check that the model is actually learning correctly?

unixpickle · 2022-07-13T23:18:25Z

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

ShoufaChen · 2022-07-14T00:41:27Z

Thanks for your help.

I will patch this bug and try again. I will post my results in about 2 days.

realPasu · 2022-07-16T07:03:26Z

The problem of NaNs still exists with this changing.

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

ShoufaChen · 2022-07-16T07:55:11Z

I am now at 430180 steps and don't meet NaN.

realPasu · 2022-07-16T08:14:17Z

That's so strange. I'm training a 256*256 model with batch size 256 and learning rate 1e-4 on 8 nodes.
You say that you didn't meet NaNs. I wonder that the exact meaning of your commet is you didn't meet NaNs anymore or you didn't meet the problem of infinitely decreasing lg_loss_scale even if you met NaN?
After applying the changing, my training log is still similar to the origin one. My training process is resumed from a partly trained model with about 300k iterations. While training, I met NaNs after thousands of iterations but it can be solved by decreasing the lg_loss_scale at most conditions. But the training will finally fail after about 10-20k iterations (decreasing lg_loss_scale) and I have to stop training and resume a new training process from the last normal checkpoint.

ShoufaChen · 2022-07-16T08:15:52Z

I am training a 128*128 ImageNet model.

forever208 · 2022-07-18T12:36:46Z

The problem of NaNs still exists with this change.

Perhaps this bug is related to the issue here: #44
If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line
self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))
to something like this:
for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

me too, still have nan loss and the training fails when training on ImageNet 64*64

forever208 · 2022-07-21T16:13:28Z

@ShoufaChen , hi, have you thoroughly solved this issue? have you got any nan loss and fail anymore?

HoJ-Onle · 2022-07-27T12:08:08Z

Hello! I also had this problem. Did you solved it? In fact, I met this problem but the program still works. Maybe the loss is not broken yet.. But it told me that "Found Nan".

----------------------------
| lg_loss_scale | -909     |
| loss          | 0.115    |
| loss_q0       | 0.261    |
| loss_q1       | 0.0599   |
| loss_q2       | 0.0339   |
| loss_q3       | 0.0241   |
| mse           | 0.111    |
| mse_q0        | 0.25     |
| mse_q1        | 0.0594   |
| mse_q2        | 0.0336   |
| mse_q3        | 0.0237   |
| samples       | 1.98e+03 |
| step          | 990      |
| vb            | 0.00385  |
| vb_q0         | 0.0104   |
| vb_q1         | 0.00048  |
| vb_q2         | 0.00031  |
| vb_q3         | 0.000323 |
----------------------------
Found NaN, decreased lg_loss_scale to -915.944
Found NaN, decreased lg_loss_scale to -916.944
Found NaN, decreased lg_loss_scale to -917.944
Found NaN, decreased lg_loss_scale to -918.944
Found NaN, decreased lg_loss_scale to -919.944
Found NaN, decreased lg_loss_scale to -920.944
Found NaN, decreased lg_loss_scale to -921.944
Found NaN, decreased lg_loss_scale to -922.944
Found NaN, decreased lg_loss_scale to -923.944

Looking forward to your reply.

forever208 · 2022-07-27T12:39:07Z

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

HoJ-Onle · 2022-07-27T13:59:48Z

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists.
And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

ZGCTroy · 2022-08-02T06:51:06Z

@JawnHoan My solution is re-git the whole repo again and implement your own method...
I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

I think It is normal to find Nan during Mixed Precision Training and "decrease lg_loss_scale" is excatly the way of fixing the problem of Nan. However, if the program keeps finding Nan means that decreasing lg_loss_scale is not able to fix the problem.

fido20160817 · 2022-08-17T02:44:28Z

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

fido20160817 · 2022-08-17T03:36:52Z

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

Hi, how do you achieve multi node multi GPU training, do you changes the code? I try multi-node-multi-GPU on other programe, but I failed, because the slow commucation between different nodes, do you notice this, can you share some experience of multi-node-multi-GPU training?

forever208 · 2022-08-22T07:01:45Z

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

@fido20160817 it is normal, no worries about it

fido20160817 · 2022-08-22T14:59:04Z

Thanks！🤝

forever208 · 2022-08-23T09:52:04Z

@JawnHoan hi, if you still have this issue, I suggest you decrease the learning rate.

In my experiments, I use batch=128 for Imagenet64, lr=1e-4 cause this nan issue.
Therefore, I changed the learning rate from 1e-4 to 3e-5, problem solved.
Hope this will be helpful

ONobody · 2023-03-01T06:18:29Z

@forever208 Hello, may I add your contact information to ask some questions? Thank you.

forever208 · 2023-03-01T07:08:57Z

Hi @ONobody, of course, my email: [email protected]

hxy-123-coder · 2023-06-13T07:06:33Z

Hello！ I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

forever208 · 2023-06-13T07:20:08Z

Hello！ I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

about 0.055. ImageNet is the most time-consuming dataset to train, I suggest you first try with cifar10 or LSUN datasets.

Self-promotion: our ICML 2023 paper DDPM-IP shows an extremely easy way to dramatically improve the FID and training speed based on guided-diffusion, feel free to take a look.

hxy-123-coder · 2023-06-13T13:55:06Z

Thanks a lot.

JingfengYao mentioned this issue Feb 18, 2025

How to train LightningDiT-XL/1 with VA-VAE for 800 epochs? hustvl/LightningDiT#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan occurs when training ImageNet 128x128 #50

nan occurs when training ImageNet 128x128 #50

ShoufaChen commented Jul 12, 2022

unixpickle commented Jul 12, 2022 via email

ShoufaChen commented Jul 13, 2022

unixpickle commented Jul 13, 2022

unixpickle commented Jul 13, 2022

ShoufaChen commented Jul 14, 2022

realPasu commented Jul 16, 2022

ShoufaChen commented Jul 16, 2022 •

edited

Loading

realPasu commented Jul 16, 2022

ShoufaChen commented Jul 16, 2022

forever208 commented Jul 18, 2022 •

edited

Loading

forever208 commented Jul 21, 2022

HoJ-Onle commented Jul 27, 2022 •

edited

Loading

forever208 commented Jul 27, 2022

HoJ-Onle commented Jul 27, 2022

ZGCTroy commented Aug 2, 2022

fido20160817 commented Aug 17, 2022 •

edited

Loading

fido20160817 commented Aug 17, 2022 •

edited

Loading

forever208 commented Aug 22, 2022

fido20160817 commented Aug 22, 2022

forever208 commented Aug 23, 2022

ONobody commented Mar 1, 2023

forever208 commented Mar 1, 2023 •

edited

Loading

hxy-123-coder commented Jun 13, 2023

forever208 commented Jun 13, 2023 •

edited

Loading

hxy-123-coder commented Jun 13, 2023

nan occurs when training ImageNet 128x128 #50

nan occurs when training ImageNet 128x128 #50

Comments

ShoufaChen commented Jul 12, 2022

unixpickle commented Jul 12, 2022 via email

ShoufaChen commented Jul 13, 2022

unixpickle commented Jul 13, 2022

unixpickle commented Jul 13, 2022

ShoufaChen commented Jul 14, 2022

realPasu commented Jul 16, 2022

ShoufaChen commented Jul 16, 2022 • edited Loading

realPasu commented Jul 16, 2022

ShoufaChen commented Jul 16, 2022

forever208 commented Jul 18, 2022 • edited Loading

forever208 commented Jul 21, 2022

HoJ-Onle commented Jul 27, 2022 • edited Loading

forever208 commented Jul 27, 2022

HoJ-Onle commented Jul 27, 2022

ZGCTroy commented Aug 2, 2022

fido20160817 commented Aug 17, 2022 • edited Loading

fido20160817 commented Aug 17, 2022 • edited Loading

forever208 commented Aug 22, 2022

fido20160817 commented Aug 22, 2022

forever208 commented Aug 23, 2022

ONobody commented Mar 1, 2023

forever208 commented Mar 1, 2023 • edited Loading

hxy-123-coder commented Jun 13, 2023

forever208 commented Jun 13, 2023 • edited Loading

hxy-123-coder commented Jun 13, 2023

ShoufaChen commented Jul 16, 2022 •

edited

Loading

forever208 commented Jul 18, 2022 •

edited

Loading

HoJ-Onle commented Jul 27, 2022 •

edited

Loading

fido20160817 commented Aug 17, 2022 •

edited

Loading

fido20160817 commented Aug 17, 2022 •

edited

Loading

forever208 commented Mar 1, 2023 •

edited

Loading

forever208 commented Jun 13, 2023 •

edited

Loading