RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #84

DWBSIC · 2023-04-13T02:35:15Z

Traceback (most recent call last):
File "segmentation_train.py", line 117, in
main()
File "segmentation_train.py", line 69, in main
TrainLoop(
File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 83, in init
self._load_and_sync_parameters()
File "E:\MedSegDiff-master2.0\guided_diffusion\train_util.py", line 139, in _load_and_sync_parameters
dist_util.sync_params(self.model.parameters())
File "E:\MedSegDiff-master2.0\guided_diffusion\dist_util.py", line 78, in sync_params
dist.broadcast(p, 0)
File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1438, in wrapper
return func(*args, **kwargs)
File "D:\anaconda3\envs\py38\lib\site-packages\torch\distributed\distributed_c10d.py", line 1561, in broadcast
work.wait()
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

您好，我在Linux系统中进行训练没有任何问题，但是我尝试在Windows系统下进行训练，所有的参数配置都是按照作者在readme中的参数设置，请问为什么会出现上述的错误，请大家帮助解决。

Hello, I have no problem with training in Linux system, but I try to train in Windows system, and all parameters are set according to the author's parameter Settings in readme. May I ask why the above errors occur, please help to solve.

if you want to chat with me: WeChat: DWBSIC

170744039 · 2023-04-16T11:28:17Z

我也遇到了这个问题，因为我是windows，dist.init_process_group(backend="gloo", init_method="env://")，所以初始化中的backend首先需要修改，不然就会报ncll错误。然后就遇到了跟您一样的问题，但是这样修改后可以跑通，在dist.broadcast前加入
p = p +0 就可以了。我用的是自己的训练数据，但是现在的问题是GPU 利用率非常低，只能达到六七十的样子。请问您完美解决这个问题了吗

xupinggl · 2023-05-07T07:46:03Z

It has been solved according to your method. Thank you very much! @170744039

blackcat1121 · 2023-05-22T14:10:46Z

@170744039 Thank you for the solution! I have so limited knowledge of DDP that I have no clue how to fix it even though I know its probably a Windows/Linux problem. Could you share more how p = p +0 solves "a leaf Variable that requires grad is being used in an in-place operation"?

WuJunde · 2023-05-25T13:36:29Z

我也遇到了这个问题，因为我是windows，dist.init_process_group(backend="gloo", init_method="env://")，所以初始化中的backend首先需要修改，不然就会报ncll错误。然后就遇到了跟您一样的问题，但是这样修改后可以跑通，在dist.broadcast前加入
p = p +0 就可以了。我用的是自己的训练数据，但是现在的问题是GPU 利用率非常低，只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

WuJunde · 2023-05-25T13:48:21Z

我也遇到了这个问题，因为我是windows，dist.init_process_group(backend="gloo", init_method="env://")，所以初始化中的backend首先需要修改，不然就会报ncll错误。然后就遇到了跟您一样的问题，但是这样修改后可以跑通，在dist.broadcast前加入
p = p +0 就可以了。我用的是自己的训练数据，但是现在的问题是GPU 利用率非常低，只能达到六七十的样子。请问您完美解决这个问题了吗

@170744039 Hi! Thank you for your awesome solution for Windows. Could you help to pull a request for your modification? I do not have Windows-installed pc, so it would be great if someone could help its windows extensibility.

Longchentong · 2024-01-28T02:46:36Z

dist_util.py

old

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)

new

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
p=p+0
with th.no_grad():
dist.broadcast(p, 0)

Mnk208 mentioned this issue Dec 4, 2024

Help！RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #84

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #84

DWBSIC commented Apr 13, 2023

170744039 commented Apr 16, 2023

xupinggl commented May 7, 2023

blackcat1121 commented May 22, 2023

WuJunde commented May 25, 2023

WuJunde commented May 25, 2023

Longchentong commented Jan 28, 2024

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #84

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #84

Comments

DWBSIC commented Apr 13, 2023

170744039 commented Apr 16, 2023

xupinggl commented May 7, 2023

blackcat1121 commented May 22, 2023

WuJunde commented May 25, 2023

WuJunde commented May 25, 2023

Longchentong commented Jan 28, 2024

dist_util.py

old

def sync_params(params): """ Synchronize a sequence of Tensors across ranks from rank 0. """ for p in params: with th.no_grad(): dist.broadcast(p, 0)

new

def sync_params(params):
"""
Synchronize a sequence of Tensors across ranks from rank 0.
"""
for p in params:
with th.no_grad():
dist.broadcast(p, 0)