We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafactory
### model model_name_or_path: /home/xxx/.cache/modelscope/hub/qwen/Qwen2___5-0___5B ### method template: qwen stage: pt do_train: true finetuning_type: lora lora_target: all ### dataset dataset: mydataset cutoff_len: 16384 overwrite_cache: true preprocessing_num_workers: 8 ### output output_dir: saves/Qwen2.5-0.5B/lora/pretrain logging_steps: 10 save_strategy: epoch plot_loss: true overwrite_output_dir: true ### train per_device_train_batch_size: 4 gradient_accumulation_steps: 16 learning_rate: 1.0e-5 num_train_epochs: 10 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
本人在src/llamafactory/train/pt/workflow.py的run_pt打断点调试,尝试了preprocessing_num_workers为8/16/32三种情况,,得到的样本数量分别为603/600/593。
src/llamafactory/train/pt/workflow.py
run_pt
在下图中,样本数量为605、训练轮数为10、total_batch_size为64。我的理解是正常训练步数应该是605×10/64=94.53,所以应该总共训练95步,或者每轮按照605/64=9.45取整为10,总共训练10×10=100步。但是这里显示只有90步。
由于save_strategy设置为epoch,训练后checkpoint对应的step也发生了偏移,甚至最后的两个epoch相隔只有2个step。
save_strategy
No response
以上问题希望作者能够帮忙解答疑惑,谢谢!
The text was updated successfully, but these errors were encountered:
应该是 (605 // 64) * 10 = 90,一个 epoch 9 个 step。最后差两个是因为训练结束后始终会保存一个 checkpoint
Sorry, something went wrong.
@hiyouga 按照您的思路的话,我的理解是每个epoch对应的step会因为 // 运算导致截断,而位于小数部分的数据会被丢弃/无法训练。
那么请问一下:
谢谢!
No branches or pull requests
Reminder
System Info
llamafactory
version: 0.9.1.dev0Reproduction
配置文件
问题
1. 不同preprocessing_num_workers下,样本数量不一致
本人在
src/llamafactory/train/pt/workflow.py
的run_pt
打断点调试,尝试了preprocessing_num_workers为8/16/32三种情况,,得到的样本数量分别为603/600/593。2. 样本数量、训练步数以及checkpoint不对应
在下图中,样本数量为605、训练轮数为10、total_batch_size为64。我的理解是正常训练步数应该是605×10/64=94.53,所以应该总共训练95步,或者每轮按照605/64=9.45取整为10,总共训练10×10=100步。但是这里显示只有90步。
由于
save_strategy
设置为epoch,训练后checkpoint对应的step也发生了偏移,甚至最后的两个epoch相隔只有2个step。Expected behavior
No response
Others
以上问题希望作者能够帮忙解答疑惑,谢谢!
The text was updated successfully, but these errors were encountered: