求助：模型qwen2.5-7b-instruct全量sft的时候，训练过程中loss突然变为0。 #6109

Chtholly1 · 2024-11-22T01:31:31Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.35
Python version: 3.10.0
PyTorch version: 2.2.0+cu121 (GPU)
Transformers version: 4.43.4
Datasets version: 2.18.0
Accelerate version: 0.32.0
PEFT version: 0.12.0
TRL version: 0.8.6
GPU type: NVIDIA H800 PCIe
DeepSpeed version: 0.14.0
Bitsandbytes version: 0.43.1

Reproduction

'''
NPROC_PER_NODE=4
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500

CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun
--nproc_per_node $NPROC_PER_NODE
--nnodes $NNODES
--node_rank $RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
src/train.py examples/train_full/qwen2_sft.yaml
'''

Expected behavior

如图所示，一开始loss下降正常，但是会突然归零，我尝试了不同大小的学习率，目前已经改成1e-7了，模型还是会训崩。

奇怪的是我尝试了用lora方法进行sft，模型可以稳定训练，但是全量sft就会训崩。

Others

No response

The text was updated successfully, but these errors were encountered:

JeremySun1224 · 2024-11-25T02:55:56Z

你这个应该是数值精度问题. 全量sft所有参数都被更新, 梯度范围会更大, 如果部分参数的梯度值过大, 容易导致训练不稳定, 可能表现为grad_norm为nan. LoRA仅更新低秩插入的权重参数, 梯度空间受限, 训练稳定. 可以先尝试单精度sft, 没问题再半精度并把梯度裁剪到更小的值, 比如1.0试试.

github-actions bot added the pending This problem is yet to be addressed label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

求助：模型qwen2.5-7b-instruct全量sft的时候，训练过程中loss突然变为0。 #6109

求助：模型qwen2.5-7b-instruct全量sft的时候，训练过程中loss突然变为0。 #6109

Chtholly1 commented Nov 22, 2024

JeremySun1224 commented Nov 25, 2024 •

edited

Loading

求助：模型qwen2.5-7b-instruct全量sft的时候，训练过程中loss突然变为0。 #6109

求助：模型qwen2.5-7b-instruct全量sft的时候，训练过程中loss突然变为0。 #6109

Comments

Chtholly1 commented Nov 22, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

JeremySun1224 commented Nov 25, 2024 • edited Loading

JeremySun1224 commented Nov 25, 2024 •

edited

Loading