Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #84

Open
DWBSIC opened this issue Apr 13, 2023 · 6 comments

Comments

@DWBSIC
Copy link

DWBSIC commented Apr 13, 2023

Traceback (most recent call last):
File "segmentation_train.py", line 117, in
main()
File "segmentation_train.py", line 69, in main
TrainLoop(
File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 83, in init
self._load_and_sync_parameters()
File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 139, in _load_and_sync_parameters
dist_util.sync_params(self.model.parameters())
File "E:\MedSegDiff-master2.0\guided_diffusion\dist_util.py", line 78, in sync_params
dist.broadcast(p, 0)
File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1438, in wrapper
return func(*args, **kwargs)
File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1561, in broadcast
work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

您好,我在Linux系统中进行训练没有任何问题,但是我尝试在Windows系统下进行训练,所有的参数配置都是按照作者在readme中的参数设置,请问为什么会出现上述的错误,请大家帮助解决。

Hello, I have no problem with training in Linux system, but I try to train in Windows system, and all parameters are set according to the author's parameter Settings in readme. May I ask why the above errors occur, please help to solve.

if you want to chat with me: WeChat: DWBSIC

@170744039
Copy link

我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入
p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗

@xupinggl
Copy link

xupinggl commented May 7, 2023

It has been solved according to your method. Thank you very much! @170744039

@blackcat1121
Copy link

@170744039 Thank you for the solution! I have so limited knowledge of DDP that I have no clue how to fix it even though I know its probably a Windows/Linux problem. Could you share more how p = p +0 solves "a leaf Variable that requires grad is being used in an in-place operation"?

@WuJunde
Copy link
Collaborator

WuJunde commented May 25, 2023

我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入
p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

1 similar comment
@WuJunde
Copy link
Collaborator

WuJunde commented May 25, 2023

我也遇到了这个问题,因为我是windows,dist.init_process_group(backend="gloo", init_method="env://"),所以初始化中的backend首先需要修改,不然就会报ncll错误。然后就遇到了跟您一样的问题,但是这样修改后可以跑通,在dist.broadcast前加入
p = p +0 就可以了。我用的是自己的训练数据,但是现在的问题是GPU 利用率非常低,只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

@Longchentong
Copy link

dist_util.py

old

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)

new

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
p=p+0
with th.no_grad():
dist.broadcast(p, 0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants