You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/cli/sft.py", line 5, in <module>
sft_main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/train/sft.py", line 268, in sft_main
return SwiftSft(args).main()
File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/base.py", line 46, in main
result = self.run()
File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/train/sft.py", line 141, in run
return self.train(trainer)
File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/train/sft.py", line 193, in train
trainer.train(trainer.args.resume_from_checkpoint)
File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/trainers/mixin.py", line 261, in train
res = super().train(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/trainer.py", line 2241, in train
return inner_training_loop(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/trainer.py", line 3747, in training_step
self.accelerator.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/accelerator.py", line 2233, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
self.engine.backward(loss, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/autograd/__init__.py", line 256, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function SumBackward0 returned an invalid gradient at index 0 - got [0] but expected shape compatible with [12480]
Your hardware and system info
npu: 晟腾910B4
pytroch/pytroch-npu==2.1.0
transformers==4.46.2 (4.49.也不行)
accelerate==1.1.1
deepspeed==0.14. (0.15.*也不行)
操作系统: pytorch_2.1.0-cann_8.0.rc2-py_3.9-euler_2.10.7:0.0.1
Describe the bug
在910B上,基于Qwen7B做SLAM-omni语音模型的复现,需要使用deepspeed的zero3来启动训练
1.zero3启动训练后报错。
2.zero3启动训练时,如果冻结新增的网络结构部分(group_decode_adapter),可以正常训练。不冻结则会报错
3.zero2启动训练是正常的,但由于显存限制只能训3B的模型。所以需要使用zero3
报错信息:
启动会报错的zero3启动脚本:
基于Qwen扩展新增的网络结构部分:
新增部分结构在forward中的使用:
Your hardware and system info
npu: 晟腾910B4
pytroch/pytroch-npu==2.1.0
transformers==4.46.2 (4.49.也不行)
accelerate==1.1.1
deepspeed==0.14. (0.15.*也不行)
操作系统: pytorch_2.1.0-cann_8.0.rc2-py_3.9-euler_2.10.7:0.0.1
Additional context
尝试了限制自定义网络结构部分(group_decode_adapter)不做切片,还是会报错。训练过程打印loss值均正常,require_grads均符合预期(llm, group_decode_adapter等均没有冻结)
The text was updated successfully, but these errors were encountered: