We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafactory
''' NPROC_PER_NODE=4 NNODES=1 RANK=0 MASTER_ADDR=127.0.0.1 MASTER_PORT=29500
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nproc_per_node $NPROC_PER_NODE --nnodes $NNODES --node_rank $RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT src/train.py examples/train_full/qwen2_sft.yaml '''
如图所示,一开始loss下降正常,但是会突然归零,我尝试了不同大小的学习率,目前已经改成1e-7了,模型还是会训崩。
奇怪的是我尝试了用lora方法进行sft,模型可以稳定训练,但是全量sft就会训崩。
No response
The text was updated successfully, but these errors were encountered:
你这个应该是数值精度问题. 全量sft所有参数都被更新, 梯度范围会更大, 如果部分参数的梯度值过大, 容易导致训练不稳定, 可能表现为grad_norm为nan. LoRA仅更新低秩插入的权重参数, 梯度空间受限, 训练稳定. 可以先尝试单精度sft, 没问题再半精度并把梯度裁剪到更小的值, 比如1.0试试.
Sorry, something went wrong.
No branches or pull requests
Reminder
System Info
llamafactory
version: 0.9.1.dev0Reproduction
'''
NPROC_PER_NODE=4
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500
CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun
--nproc_per_node $NPROC_PER_NODE
--nnodes $NNODES
--node_rank $RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
src/train.py examples/train_full/qwen2_sft.yaml
'''
Expected behavior
如图所示,一开始loss下降正常,但是会突然归零,我尝试了不同大小的学习率,目前已经改成1e-7了,模型还是会训崩。
奇怪的是我尝试了用lora方法进行sft,模型可以稳定训练,但是全量sft就会训崩。
Others
No response
The text was updated successfully, but these errors were encountered: