Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

求助:模型qwen2.5-7b-instruct全量sft的时候,训练过程中loss突然变为0。 #6109

Open
1 task done
Chtholly1 opened this issue Nov 22, 2024 · 1 comment
Open
1 task done
Labels
pending This problem is yet to be addressed

Comments

@Chtholly1
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.35
  • Python version: 3.10.0
  • PyTorch version: 2.2.0+cu121 (GPU)
  • Transformers version: 4.43.4
  • Datasets version: 2.18.0
  • Accelerate version: 0.32.0
  • PEFT version: 0.12.0
  • TRL version: 0.8.6
  • GPU type: NVIDIA H800 PCIe
  • DeepSpeed version: 0.14.0
  • Bitsandbytes version: 0.43.1

Reproduction

'''
NPROC_PER_NODE=4
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500

CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun
--nproc_per_node $NPROC_PER_NODE
--nnodes $NNODES
--node_rank $RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
src/train.py examples/train_full/qwen2_sft.yaml
'''

1732238904746

Expected behavior

如图所示,一开始loss下降正常,但是会突然归零,我尝试了不同大小的学习率,目前已经改成1e-7了,模型还是会训崩。

奇怪的是我尝试了用lora方法进行sft,模型可以稳定训练,但是全量sft就会训崩。

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Nov 22, 2024
@JeremySun1224
Copy link

JeremySun1224 commented Nov 25, 2024

你这个应该是数值精度问题. 全量sft所有参数都被更新, 梯度范围会更大, 如果部分参数的梯度值过大, 容易导致训练不稳定, 可能表现为grad_norm为nan. LoRA仅更新低秩插入的权重参数, 梯度空间受限, 训练稳定. 可以先尝试单精度sft, 没问题再半精度并把梯度裁剪到更小的值, 比如1.0试试.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants