Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug, there is a error when train with fp16 #19

Open
Aaron2117 opened this issue Apr 16, 2024 · 9 comments
Open

bug, there is a error when train with fp16 #19

Aaron2117 opened this issue Apr 16, 2024 · 9 comments

Comments

@Aaron2117
Copy link

No description provided.

@i-amgeek
Copy link

Atleast you should have shared the error here. No point in raising issues like this. Feels like spam.

@Aaron2117
Copy link
Author

File "/data3/whr/AIGC/OOTDiffusion-train-main/net/v5_basev4_shape512_fixseed_dataauc_newagnostc/run/ootd_train.py", line 595, in
optimizer.step()
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/accelerate/optimizer.py", line 132, in step
self.scaler.step(self.optimizer, closure)
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 370, in step
self.unscale_(optimizer)
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
optimizer_state["found_inf_per_device"] = self.unscale_grads(optimizer, inv_scale, found_inf, False)
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

but it is ok when i train with fp32

@Aaron2117
Copy link
Author

@i-amgeek @lyc0929

@zhangquanwei962
Copy link
Collaborator

@i-amgeek @lyc0929
Hi, I can give you some suggestions.

  1. First, we didn't manually set the Floating point precision. 2. You can try to use accelerate config to update your config by open deepspeed and use FP16.

@Aaron2117
Copy link
Author

@i-amgeek @lyc0929
Hi, I can give you some suggestions.

  1. First, we didn't manually set the Floating point precision. 2. You can try to use accelerate config to update your config by open deepspeed and use FP16.

I didn't manually set the Floating point precision. I set fp16 by accelerate config.this is my config file

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '3'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The training code works when i set mixed_precision to no

@joe-zxh
Copy link

joe-zxh commented Apr 17, 2024

encounter similar problem too

@rohitpaul23
Copy link

same here

@paluchnuggets
Copy link

I have the same error, has anyone solve that?

@coolistener
Copy link

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants