Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan occurs when training ImageNet 128x128 #50

Open
ShoufaChen opened this issue Jul 12, 2022 · 25 comments
Open

nan occurs when training ImageNet 128x128 #50

ShoufaChen opened this issue Jul 12, 2022 · 25 comments

Comments

@ShoufaChen
Copy link

Hi, @unixpickle

Thanks for your awesome work and open source.

I met the nan issue when training on ImageNet 128x128,

-----------------------------
| lg_loss_scale | -1.62e+04 |
| loss          | nan       |
| loss_q0       | nan       |
| loss_q1       | nan       |
| loss_q2       | nan       |
| loss_q3       | nan       |
| mse           | nan       |
| mse_q0        | nan       |
| mse_q1        | nan       |
| mse_q2        | nan       |
| mse_q3        | nan       |
| samples       | 3.92e+07  |
| step          | 1.53e+05  |
| vb            | nan       |
| vb_q0         | nan       |
| vb_q1         | nan       |
| vb_q2         | nan       |
| vb_q3         | nan       |
-----------------------------
Found NaN, decreased lg_loss_scale to -16199.354
Found NaN, decreased lg_loss_scale to -16200.354
Found NaN, decreased lg_loss_scale to -16201.354
Found NaN, decreased lg_loss_scale to -16202.354
Found NaN, decreased lg_loss_scale to -16203.354

I used fp16. Did you meet similar issues?

Thanks in advance.

@unixpickle
Copy link
Collaborator

unixpickle commented Jul 12, 2022 via email

@ShoufaChen
Copy link
Author

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

@unixpickle
Copy link
Collaborator

Do you have a record of the loss before the NaN occurred? Did it spike right before NaNs started happening?

Your command itself looks good to me, so I don't think it's a simple hyperparameter issue. Also, have you tried looking at samples from before the divergence, as a sanity check that the model is actually learning correctly?

@unixpickle
Copy link
Collaborator

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

@ShoufaChen
Copy link
Author

Thanks for your help.

I will patch this bug and try again. I will post my results in about 2 days.

@realPasu
Copy link

The problem of NaNs still exists with this changing.

Perhaps this bug is related to the issue here: #44

If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

@ShoufaChen
Copy link
Author

ShoufaChen commented Jul 16, 2022

I am now at 430180 steps and don't meet NaN.

@realPasu
Copy link

That's so strange. I'm training a 256*256 model with batch size 256 and learning rate 1e-4 on 8 nodes.
You say that you didn't meet NaNs. I wonder that the exact meaning of your commet is you didn't meet NaNs anymore or you didn't meet the problem of infinitely decreasing lg_loss_scale even if you met NaN?
After applying the changing, my training log is still similar to the origin one. My training process is resumed from a partly trained model with about 300k iterations. While training, I met NaNs after thousands of iterations but it can be solved by decreasing the lg_loss_scale at most conditions. But the training will finally fail after about 10-20k iterations (decreasing lg_loss_scale) and I have to stop training and resume a new training process from the last normal checkpoint.

@ShoufaChen
Copy link
Author

I am training a 128*128 ImageNet model.

@forever208
Copy link

forever208 commented Jul 18, 2022

The problem of NaNs still exists with this change.

Perhaps this bug is related to the issue here: #44
If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line

self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale))

to something like this:

for p in self.master_params:
    p.grad.mul_(1.0 / (2 ** self.lg_loss_scale))

me too, still have nan loss and the training fails when training on ImageNet 64*64

@forever208
Copy link

@ShoufaChen , hi, have you thoroughly solved this issue? have you got any nan loss and fail anymore?

@HoJ-Onle
Copy link

HoJ-Onle commented Jul 27, 2022

Hello! I also had this problem. Did you solved it? In fact, I met this problem but the program still works. Maybe the loss is not broken yet.. But it told me that "Found Nan".

----------------------------
| lg_loss_scale | -909     |
| loss          | 0.115    |
| loss_q0       | 0.261    |
| loss_q1       | 0.0599   |
| loss_q2       | 0.0339   |
| loss_q3       | 0.0241   |
| mse           | 0.111    |
| mse_q0        | 0.25     |
| mse_q1        | 0.0594   |
| mse_q2        | 0.0336   |
| mse_q3        | 0.0237   |
| samples       | 1.98e+03 |
| step          | 990      |
| vb            | 0.00385  |
| vb_q0         | 0.0104   |
| vb_q1         | 0.00048  |
| vb_q2         | 0.00031  |
| vb_q3         | 0.000323 |
----------------------------
Found NaN, decreased lg_loss_scale to -915.944
Found NaN, decreased lg_loss_scale to -916.944
Found NaN, decreased lg_loss_scale to -917.944
Found NaN, decreased lg_loss_scale to -918.944
Found NaN, decreased lg_loss_scale to -919.944
Found NaN, decreased lg_loss_scale to -920.944
Found NaN, decreased lg_loss_scale to -921.944
Found NaN, decreased lg_loss_scale to -922.944
Found NaN, decreased lg_loss_scale to -923.944

Looking forward to your reply.

@forever208
Copy link

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

@HoJ-Onle
Copy link

@JawnHoan My solution is re-git the whole repo again and implement your own method...

I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists.
And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

@ZGCTroy
Copy link

ZGCTroy commented Aug 2, 2022

@JawnHoan My solution is re-git the whole repo again and implement your own method...
I know it is not a good idea, but works for me.

Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. And I'm curious if the result will be wrong if “Found NaN, decreased lg_loss_scale to xx” happens and the training doesn't stop

I think It is normal to find Nan during Mixed Precision Training and "decrease lg_loss_scale" is excatly the way of fixing the problem of Nan. However, if the program keeps finding Nan means that decreasing lg_loss_scale is not able to fix the problem.

@fido20160817
Copy link

fido20160817 commented Aug 17, 2022

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

@fido20160817
Copy link

fido20160817 commented Aug 17, 2022

Hi, @unixpickle

Thanks for your help.

My command:

MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 128 --learn_sigma True --num_channels 256 --num_heads 4 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 8"

OMP_NUM_THREADS=1 NCCL_IB_GID_INDEX=3 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=4 --node_rank=$1 \
    --master_addr=$CHIEF_IP --master_port=22268 \
    --use_env scripts/image_train.py \
    --data_dir /dev/shm/imagenet/train \
    $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

I use 4 nodes, each of which has 8 GPUs.

Hi, how do you achieve multi node multi GPU training, do you changes the code? I try multi-node-multi-GPU on other programe, but I failed, because the slow commucation between different nodes, do you notice this, can you share some experience of multi-node-multi-GPU training?

@forever208
Copy link

Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it?

_----------------------------
| grad_norm     | 0.144    |
| lg_loss_scale | 23.3     |
| loss          | 0.185    |
| loss_q0       | 0.285    |
| loss_q1       | 0.0296   |
| loss_q2       | 0.0139   |
| loss_q3       | 0.44     |
| mse           | 0.0367   |
| mse_q0        | 0.147    |
| mse_q1        | 0.029    |
| mse_q2        | 0.0136   |
| mse_q3        | 0.00291  |
| param_norm    | 303      |
| samples       | 2.62e+04 |
| step          | 3.27e+03 |
| vb            | 0.148    |
| vb_q0         | 0.138    |
| vb_q1         | 0.000615 |
| vb_q2         | 0.000278 |
| vb_q3         | 0.437    |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm     | 0.13     |
| lg_loss_scale | 23.6     |
| loss          | 0.0725   |
| loss_q0       | 0.205    |
| loss_q1       | 0.0294   |
| loss_q2       | 0.0108   |
| loss_q3       | 0.00471  |
| mse           | 0.0481   |
| mse_q0        | 0.127    |
| mse_q1        | 0.0288   |
| mse_q2        | 0.0105   |
| mse_q3        | 0.00452  |
| param_norm    | 307      |
| samples       | 3.71e+04 |
| step          | 4.64e+03 |
| vb            | 0.0245   |
| vb_q0         | 0.0776   |
| vb_q1         | 0.00059  |
| vb_q2         | 0.00021  |
| vb_q3         | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
...

@fido20160817 it is normal, no worries about it

@fido20160817
Copy link

Thanks!🤝

@forever208
Copy link

@JawnHoan hi, if you still have this issue, I suggest you decrease the learning rate.

In my experiments, I use batch=128 for Imagenet64, lr=1e-4 cause this nan issue.
Therefore, I changed the learning rate from 1e-4 to 3e-5, problem solved.
Hope this will be helpful

@ONobody
Copy link

ONobody commented Mar 1, 2023

@forever208 Hello, may I add your contact information to ask some questions? Thank you.

@forever208
Copy link

forever208 commented Mar 1, 2023

Hi @ONobody, of course, my email: [email protected]

@hxy-123-coder
Copy link

Hello! I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

@forever208
Copy link

forever208 commented Jun 13, 2023

Hello! I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all.

about 0.055. ImageNet is the most time-consuming dataset to train, I suggest you first try with cifar10 or LSUN datasets.

Self-promotion: our ICML 2023 paper DDPM-IP shows an extremely easy way to dramatically improve the FID and training speed based on guided-diffusion, feel free to take a look.

@hxy-123-coder
Copy link

Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants