-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan occurs when training ImageNet 128x128 #50
Comments
Hi Shoufa,
Could you please send the exact command you are running for training?
This is indeed a NaN during the forward pass (hence losses are NaN), which
looks like a divergence.
…On Tue, Jul 12, 2022 at 12:34 AM Shoufa Chen ***@***.***> wrote:
Hi, @unixpickle <https://github.com/unixpickle>
Thanks for your awesome work and open source.
I met the nan issue when training on ImageNet 128x128,
-----------------------------
| lg_loss_scale | -1.62e+04 |
| loss | nan |
| loss_q0 | nan |
| loss_q1 | nan |
| loss_q2 | nan |
| loss_q3 | nan |
| mse | nan |
| mse_q0 | nan |
| mse_q1 | nan |
| mse_q2 | nan |
| mse_q3 | nan |
| samples | 3.92e+07 |
| step | 1.53e+05 |
| vb | nan |
| vb_q0 | nan |
| vb_q1 | nan |
| vb_q2 | nan |
| vb_q3 | nan |
-----------------------------
Found NaN, decreased lg_loss_scale to -16199.354
Found NaN, decreased lg_loss_scale to -16200.354
Found NaN, decreased lg_loss_scale to -16201.354
Found NaN, decreased lg_loss_scale to -16202.354
Found NaN, decreased lg_loss_scale to -16203.354
I used fp16. Did you meet similar issues?
Thanks in advance.
—
Reply to this email directly, view it on GitHub
<#50>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDEBJI44CYURC4FKFL2G3VTUNXTANCNFSM53J7YVXA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, @unixpickle Thanks for your help. My command:
I use 4 nodes, each of which has 8 GPUs. |
Do you have a record of the loss before the NaN occurred? Did it spike right before NaNs started happening? Your command itself looks good to me, so I don't think it's a simple hyperparameter issue. Also, have you tried looking at samples from before the divergence, as a sanity check that the model is actually learning correctly? |
Perhaps this bug is related to the issue here: #44 If so, perhaps we could try patching that bug and see if the error resolves itself. The patch would involve changing this line self.master_params[0].grad.mul_(1.0 / (2 ** self.lg_loss_scale)) to something like this: for p in self.master_params:
p.grad.mul_(1.0 / (2 ** self.lg_loss_scale)) |
Thanks for your help. I will patch this bug and try again. I will post my results in about 2 days. |
The problem of NaNs still exists with this changing.
|
I am now at 430180 steps and don't meet |
That's so strange. I'm training a 256*256 model with batch size 256 and learning rate 1e-4 on 8 nodes. |
I am training a 128*128 ImageNet model. |
The problem of NaNs still exists with this change.
me too, still have nan loss and the training fails when training on ImageNet 64*64 |
@ShoufaChen , hi, have you thoroughly solved this issue? have you got any nan loss and fail anymore? |
Hello! I also had this problem. Did you solved it? In fact, I met this problem but the program still works. Maybe the loss is not broken yet.. But it told me that "Found Nan". ----------------------------
| lg_loss_scale | -909 |
| loss | 0.115 |
| loss_q0 | 0.261 |
| loss_q1 | 0.0599 |
| loss_q2 | 0.0339 |
| loss_q3 | 0.0241 |
| mse | 0.111 |
| mse_q0 | 0.25 |
| mse_q1 | 0.0594 |
| mse_q2 | 0.0336 |
| mse_q3 | 0.0237 |
| samples | 1.98e+03 |
| step | 990 |
| vb | 0.00385 |
| vb_q0 | 0.0104 |
| vb_q1 | 0.00048 |
| vb_q2 | 0.00031 |
| vb_q3 | 0.000323 |
----------------------------
Found NaN, decreased lg_loss_scale to -915.944
Found NaN, decreased lg_loss_scale to -916.944
Found NaN, decreased lg_loss_scale to -917.944
Found NaN, decreased lg_loss_scale to -918.944
Found NaN, decreased lg_loss_scale to -919.944
Found NaN, decreased lg_loss_scale to -920.944
Found NaN, decreased lg_loss_scale to -921.944
Found NaN, decreased lg_loss_scale to -922.944
Found NaN, decreased lg_loss_scale to -923.944 Looking forward to your reply. |
@JawnHoan My solution is re-git the whole repo again and implement your own method... I know it is not a good idea, but works for me. |
Hi! Thanks for your suggestion. As you said, I saw that the author has updated fp16_until.py. I updated my local py, but the problem still exists. |
I think It is normal to find Nan during Mixed Precision Training and "decrease lg_loss_scale" is excatly the way of fixing the problem of Nan. However, if the program keeps finding Nan means that decreasing lg_loss_scale is not able to fix the problem. |
Is it normal that NaN occurs at intervals? when NaN appears, "decrease lg_loss_scale" will be operated. After a period of normal training, NaN appears again, and "decrease lg_loss_scale" is again involved. Is this normal or should I interrupt it? _----------------------------
| grad_norm | 0.144 |
| lg_loss_scale | 23.3 |
| loss | 0.185 |
| loss_q0 | 0.285 |
| loss_q1 | 0.0296 |
| loss_q2 | 0.0139 |
| loss_q3 | 0.44 |
| mse | 0.0367 |
| mse_q0 | 0.147 |
| mse_q1 | 0.029 |
| mse_q2 | 0.0136 |
| mse_q3 | 0.00291 |
| param_norm | 303 |
| samples | 2.62e+04 |
| step | 3.27e+03 |
| vb | 0.148 |
| vb_q0 | 0.138 |
| vb_q1 | 0.000615 |
| vb_q2 | 0.000278 |
| vb_q3 | 0.437 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.278000000004006
...
....(normal)
....(normal)
... (normal)
...
----------------------------
| grad_norm | 0.13 |
| lg_loss_scale | 23.6 |
| loss | 0.0725 |
| loss_q0 | 0.205 |
| loss_q1 | 0.0294 |
| loss_q2 | 0.0108 |
| loss_q3 | 0.00471 |
| mse | 0.0481 |
| mse_q0 | 0.127 |
| mse_q1 | 0.0288 |
| mse_q2 | 0.0105 |
| mse_q3 | 0.00452 |
| param_norm | 307 |
| samples | 3.71e+04 |
| step | 4.64e+03 |
| vb | 0.0245 |
| vb_q0 | 0.0776 |
| vb_q1 | 0.00059 |
| vb_q2 | 0.00021 |
| vb_q3 | 0.000184 |
----------------------------
Found NaN, decreased lg_loss_scale to 22.641000000005672_
...
... |
Hi, how do you achieve multi node multi GPU training, do you changes the code? I try multi-node-multi-GPU on other programe, but I failed, because the slow commucation between different nodes, do you notice this, can you share some experience of multi-node-multi-GPU training? |
@fido20160817 it is normal, no worries about it |
Thanks!🤝 |
@JawnHoan hi, if you still have this issue, I suggest you decrease the learning rate. In my experiments, I use batch=128 for Imagenet64, lr=1e-4 cause this nan issue. |
@forever208 Hello, may I add your contact information to ask some questions? Thank you. |
Hi @ONobody, of course, my email: [email protected] |
Hello! I want to ask how much the loss of the model you trained converges to? I trained on my own dataset, but the images generated are all noise. I can't see the content of the images at all. |
about 0.055. ImageNet is the most time-consuming dataset to train, I suggest you first try with cifar10 or LSUN datasets. Self-promotion: our ICML 2023 paper DDPM-IP shows an extremely easy way to dramatically improve the FID and training speed based on guided-diffusion, feel free to take a look. |
Thanks a lot. |
Hi, @unixpickle
Thanks for your awesome work and open source.
I met the
nan
issue when training on ImageNet 128x128,I used fp16. Did you meet similar issues?
Thanks in advance.
The text was updated successfully, but these errors were encountered: