bug, there is a error when train with fp16 #19

Aaron2117 · 2024-04-16T08:24:41Z

No description provided.

i-amgeek · 2024-04-16T15:49:35Z

Atleast you should have shared the error here. No point in raising issues like this. Feels like spam.

Aaron2117 · 2024-04-17T02:51:28Z

File "/data3/whr/AIGC/OOTDiffusion-train-main/net/v5_basev4_shape512_fixseed_dataauc_newagnostc/run/ootd_train.py", line 595, in
optimizer.step()
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/accelerate/optimizer.py", line 132, in step
self.scaler.step(self.optimizer, closure)
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 370, in step
self.unscale_(optimizer)
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
optimizer_state["found_inf_per_device"] = self.unscale_grads(optimizer, inv_scale, found_inf, False)
File "/data1/anaconda/envs/ootd/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in unscale_grads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

but it is ok when i train with fp32

Aaron2117 · 2024-04-17T02:51:52Z

@i-amgeek @lyc0929

zhangquanwei962 · 2024-04-17T07:11:27Z

@i-amgeek @lyc0929
Hi, I can give you some suggestions.

First, we didn't manually set the Floating point precision. 2. You can try to use accelerate config to update your config by open deepspeed and use FP16.

Aaron2117 · 2024-04-17T07:20:28Z

@i-amgeek @lyc0929
Hi, I can give you some suggestions.

First, we didn't manually set the Floating point precision. 2. You can try to use accelerate config to update your config by open deepspeed and use FP16.

I didn't manually set the Floating point precision. I set fp16 by accelerate config.this is my config file

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: '3'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

The training code works when i set mixed_precision to no

joe-zxh · 2024-04-17T13:31:23Z

encounter similar problem too

rohitpaul23 · 2024-05-01T13:28:08Z

same here

paluchnuggets · 2024-05-05T17:22:07Z

I have the same error, has anyone solve that?

coolistener · 2024-07-28T02:31:40Z

same here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug, there is a error when train with fp16 #19

bug, there is a error when train with fp16 #19

Aaron2117 commented Apr 16, 2024

i-amgeek commented Apr 16, 2024

Aaron2117 commented Apr 17, 2024

Aaron2117 commented Apr 17, 2024

zhangquanwei962 commented Apr 17, 2024

Aaron2117 commented Apr 17, 2024

joe-zxh commented Apr 17, 2024

rohitpaul23 commented May 1, 2024

paluchnuggets commented May 5, 2024

coolistener commented Jul 28, 2024

bug, there is a error when train with fp16 #19

bug, there is a error when train with fp16 #19

Comments

Aaron2117 commented Apr 16, 2024

i-amgeek commented Apr 16, 2024

Aaron2117 commented Apr 17, 2024

Aaron2117 commented Apr 17, 2024

zhangquanwei962 commented Apr 17, 2024

Aaron2117 commented Apr 17, 2024

joe-zxh commented Apr 17, 2024

rohitpaul23 commented May 1, 2024

paluchnuggets commented May 5, 2024

coolistener commented Jul 28, 2024