Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

测试使用原始精度模型微调的时候没问题,但是相同配置,使用低精度Int4微调的时候出错了? #3235

Open
thinkbig opened this issue Feb 24, 2025 · 0 comments

Comments

@thinkbig
Copy link

thinkbig commented Feb 24, 2025

微调低精度模型的时候需要额外配置什么参数吗?
核心错误是:
TypeError: output tensor must have the same type as input tensor

当前配置如下:
nproc_per_node=2

CUDA_VISIBLE_DEVICES=0,1
NPROC_PER_NODE=$nproc_per_node
swift sft
--model './models/DeepSeek-R1-Distill-Qwen-32B-Int4-W4A16'
--model_type deepseek_r1_distill
--train_type lora
--dataset './dataset/dftest/'
--num_train_epochs 1
--per_device_train_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 1e-4
--max_grad_norm 0.5
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--eval_steps 100
--save_steps 100
--save_total_limit 5
--logging_steps 5
--max_length 1024
--packing True
--gradient_checkpointing True
--output_dir output
--system 'You are a helpful assistant.'
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 4
--lr_scheduler_type cosine
--max_grad_norm 0.5
--optim adamw_bnb_8bit
--report_to none
--deepspeed zero3

完整错误栈:

[rank0]: Traceback (most recent call last):
[rank0]: File "/home/aicarmap/tools/swift/swift/cli/sft.py", line 5, in
[rank0]: sft_main()
[rank0]: File "/home/aicarmap/tools/swift/swift/llm/train/sft.py", line 257, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/home/aicarmap/tools/swift/swift/llm/base.py", line 46, in main
[rank0]: result = self.run()
[rank0]: File "/home/aicarmap/tools/swift/swift/llm/train/sft.py", line 137, in run
[rank0]: return self.train(trainer)
[rank0]: File "/home/aicarmap/tools/swift/swift/llm/train/sft.py", line 196, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/home/aicarmap/tools/swift/swift/trainers/mixin.py", line 262, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/transformers/trainer.py", line 3740, in training_step
[rank0]: self.accelerator.backward(loss, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/accelerate/accelerator.py", line 2321, in backward
[rank0]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 275, in backward
[rank0]: self.engine.step()
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2322, in step
[rank0]: self._take_model_step(lr_kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2225, in _take_model_step
[rank0]: self.optimizer.step()
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2130, in step
[rank0]: self._post_step(timer_names)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2056, in _post_step
[rank0]: self.persistent_parameters[0].all_gather(self.persistent_parameters)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1162, in all_gather
[rank0]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank0]: ret_val = func(*args, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1530, in _all_gather
[rank0]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1840, in _allgather_params_coalesced
[rank0]: h = dist.all_gather_into_tensor(allgather_params[param_idx],
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 311, in all_gather_into_tensor
[rank0]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 224, in all_gather_into_tensor
[rank0]: return self.all_gather_function(output_tensor=output_tensor,
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3435, in all_gather_into_tensor
[rank0]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank0]: TypeError: output tensor must have the same type as input tensor
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/aicarmap/tools/swift/swift/cli/sft.py", line 5, in
[rank1]: sft_main()
[rank1]: File "/home/aicarmap/tools/swift/swift/llm/train/sft.py", line 257, in sft_main
[rank1]: return SwiftSft(args).main()
[rank1]: File "/home/aicarmap/tools/swift/swift/llm/base.py", line 46, in main
[rank1]: result = self.run()
[rank1]: File "/home/aicarmap/tools/swift/swift/llm/train/sft.py", line 137, in run
[rank1]: return self.train(trainer)
[rank1]: File "/home/aicarmap/tools/swift/swift/llm/train/sft.py", line 196, in train
[rank1]: trainer.train(trainer.args.resume_from_checkpoint)
[rank1]: File "/home/aicarmap/tools/swift/swift/trainers/mixin.py", line 262, in train
[rank1]: res = super().train(*args, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/transformers/trainer.py", line 3740, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/accelerate/accelerator.py", line 2321, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 275, in backward
[rank1]: self.engine.step()
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2322, in step
[rank1]: self._take_model_step(lr_kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2225, in _take_model_step
[rank1]: self.optimizer.step()
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2130, in step
[rank1]: self._post_step(timer_names)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2056, in _post_step
[rank1]: self.persistent_parameters[0].all_gather(self.persistent_parameters)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1162, in all_gather
[rank1]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1530, in _all_gather
[rank1]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1840, in _allgather_params_coalesced
[rank1]: h = dist.all_gather_into_tensor(allgather_params[param_idx],
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 311, in all_gather_into_tensor
[rank1]: return cdb.all_gather_into_tensor(output_tensor=output_tensor, input_tensor=tensor, group=group, async_op=async_op)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 224, in all_gather_into_tensor
[rank1]: return self.all_gather_function(output_tensor=output_tensor,
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3435, in all_gather_into_tensor
[rank1]: work = group._allgather_base(output_tensor, input_tensor, opts)
[rank1]: TypeError: output tensor must have the same type as input tensor
Train: 0%| | 0/1 [02:12<?, ?it/s]
W0224 02:11:09.274000 1721624 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1721696 closing signal SIGTERM
E0224 02:11:09.539000 1721624 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1721695) of binary: /home/aicarmap/anaconda3/envs/msswift/bin/python
Traceback (most recent call last):
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in
main()
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/aicarmap/anaconda3/envs/msswift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/aicarmap/tools/swift/swift/cli/sft.py FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant