Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

npu zero3 训练自定义模型时,报错Function SumBackward0 returned an invalid gradient at index 0 #3262

Open
Zane-Qbb opened this issue Feb 25, 2025 · 0 comments

Comments

@Zane-Qbb
Copy link

Describe the bug
在910B上,基于Qwen7B做SLAM-omni语音模型的复现,需要使用deepspeed的zero3来启动训练

1.zero3启动训练后报错。
2.zero3启动训练时,如果冻结新增的网络结构部分(group_decode_adapter),可以正常训练。不冻结则会报错
3.zero2启动训练是正常的,但由于显存限制只能训3B的模型。所以需要使用zero3

报错信息:

Traceback (most recent call last):
  File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/train/sft.py", line 268, in sft_main
    return SwiftSft(args).main()
  File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/base.py", line 46, in main
    result = self.run()
  File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/train/sft.py", line 141, in run
    return self.train(trainer)
  File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/llm/train/sft.py", line 193, in train
    trainer.train(trainer.args.resume_from_checkpoint)
  File "/home/ma-user/modelarts/user-job-dir/ms-swift-master/swift/trainers/mixin.py", line 261, in train
    res = super().train(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/trainer.py", line 2241, in train
    return inner_training_loop(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/trainer.py", line 3747, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/accelerator.py", line 2233, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/accelerate/utils/deepspeed.py", line 186, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2020, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 2259, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/torch/autograd/__init__.py", line 256, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function SumBackward0 returned an invalid gradient at index 0 - got [0] but expected shape compatible with [12480]

启动会报错的zero3启动脚本:

ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift sft \
--model_type=slam_omni \
--model=/Qwen2.5-3B-for-omni \
--dataset=xxxx.json \
--num_train_epochs=26 \
--train_type=full \
--output_dir=xxxx \
--eval_steps=200000000000 \
--save_steps=2000 \
--device=npu \
--ddp_backend hccl \
--per_device_train_batch_size=1 \
--dataloader_num_workers=5 \
--lazy_tokenize true \
--torch_dtype=float32 \
--check_model=false \
--max_length=3072 \
--learning_rate=1e-4 \
--warmup_steps=1000 \
--lr_scheduler_type=cosine \
--deepspeed=zero3  >log.txt 2>&1 &

基于Qwen扩展新增的网络结构部分:

class Linear_GroupDecodeAdapter(nn.Module):
    def __init__(self, audio_vocab_size, code_layer):
        super(Linear_GroupDecodeAdapter, self).__init__()
        self.audio_vocab_size = audio_vocab_size
        self.code_layer = code_layer
        with Init(data_parallel_group=None, enabled=False):  # 不进行切片
            self.linear = nn.Linear(audio_vocab_size, code_layer * audio_vocab_size)

    def forward(self, logits):
        logits = self.linear(logits)
        return logits

新增部分结构在forward中的使用:

...
logits_a = self.group_decode_adapter(logits[..., self.model_config.vocab_config.padded_text_vocabsize:])
logits_a = torch.split(
    logits_a,
    [
        self.model_config.vocab_config.padded_audio_vocabsize,
        self.model_config.vocab_config.padded_audio_vocabsize,
        self.model_config.vocab_config.padded_audio_vocabsize
    ],
    dim=2
)
logits_a = tuple(tensor.unsqueeze(1) for tensor in logits_a)
logits_a = torch.cat(logits_a, dim=1)
loss_a = slam_loss(logits=logits_a, labels=labels_a) * 3
...

Your hardware and system info
npu: 晟腾910B4
pytroch/pytroch-npu==2.1.0
transformers==4.46.2 (4.49.也不行)
accelerate==1.1.1
deepspeed==0.14.
(0.15.*也不行)
操作系统: pytorch_2.1.0-cann_8.0.rc2-py_3.9-euler_2.10.7:0.0.1

Additional context
尝试了限制自定义网络结构部分(group_decode_adapter)不做切片,还是会报错。训练过程打印loss值均正常,require_grads均符合预期(llm, group_decode_adapter等均没有冻结)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant