Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train on multi GPU? #93

Open
kjRainy opened this issue Apr 21, 2023 · 5 comments
Open

How to train on multi GPU? #93

kjRainy opened this issue Apr 21, 2023 · 5 comments

Comments

@kjRainy
Copy link

kjRainy commented Apr 21, 2023

When I use --multi_gpu 0,1,2, it has a error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

And how to change the code?
Thanks!

@WuJunde
Copy link
Collaborator

WuJunde commented Apr 21, 2023

can you tell me which line the error is reported

@kjRainy
Copy link
Author

kjRainy commented Apr 21, 2023

When I run:
python scripts/segmentation_train.py --data_name PROMISE12 --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2

It appears that:
Traceback (most recent call last):
File "scripts/segmentation_train.py", line 113, in
main()
File "scripts/segmentation_train.py", line 82, in main
lr_anneal_steps=args.lr_anneal_steps,
File "./guided_diffusion/train_util.py", line 186, in run_loop
self.run_step(batch, cond)
File "./guided_diffusion/train_util.py", line 207, in run_step
sample = self.forward_backward(batch, cond)
File "./guided_diffusion/train_util.py", line 238, in forward_backward
losses1 = compute_losses()
File "./guided_diffusion/gaussian_diffusion.py", line 1007, in training_losses_segmentation
clip_denoised=False,
File "./guided_diffusion/gaussian_diffusion.py", line 941, in _vb_terms_bpd
model, x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs
File "./guided_diffusion/respace.py", line 90, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
File "./guided_diffusion/gaussian_diffusion.py", line 287, in p_mean_variance
model_log_variance = frac * max_log + (1 - frac) * min_log
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

@xupinggl
Copy link

运行多卡时报错,求解决

python scripts/segmentation_train.py --data_name NC2016 --data_dir "/PublicFile/xp_data/NC2016/" --out_dir "./results/NC2016/trainv1" --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2

training...
Traceback (most recent call last):
File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 117, in
main()
File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 69, in main
TrainLoop(
File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 186, in run_loop
self.run_step(batch, cond)
File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 207, in run_step
sample = self.forward_backward(batch, cond)
File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 238, in forward_backward
losses1 = compute_losses()
File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/gaussian_diffusion.py", line 1003, in training_losses_segmentation
model_output, cal = model(x_t, self._scale_timesteps(t), **model_kwargs)
File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index]
File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward
raise RuntimeError("module must have its parameters and buffers "
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

@WuJunde
Copy link
Collaborator

WuJunde commented May 25, 2023

some part of your module is on different GPUs. Did you meet the same error running on example dataset? if it has no problem on example cases, then the problem is in your data loading process.

@DBook111
Copy link

DBook111 commented Jul 4, 2023

运行多卡时报错,求解决

python scripts/segmentation_train.py --data_name NC2016 --data_dir "/PublicFile/xp_data/NC2016/" --out_dir "./results/NC2016/trainv1" --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2

training... Traceback (most recent call last): File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 117, in main() File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 69, in main TrainLoop( File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 186, in run_loop self.run_step(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 207, in run_step sample = self.forward_backward(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 238, in forward_backward losses1 = compute_losses() File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/gaussian_diffusion.py", line 1003, in training_losses_segmentation model_output, cal = model(x_t, self._scale_timesteps(t), **model_kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index] File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward raise RuntimeError("module must have its parameters and buffers " RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

我遇到了一模一样的错误,请问您解决了嘛?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants