We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
我在尝试训练7b模型的时候,训练初期预计时间是30天,训练半天后速度直接大幅度下降,现在预计需要50天了 但是我的显存还有10g左右未使用,我想知道导致我训练速度下降的原因可能是什么,会和我batch size 设置比较大有关吗
还有想请问下输出的loss值是由什么参数控制的,是logging_steps 吗
model_name_or_path: /mnt/sde/shixing/models/Qwen2.5-7B-Instruct
stage: sft do_train: true finetuning_type: lora deepspeed: /mnt/sde/shixing/LLaMA-Factory/examples/deepspeed/ds_z0_config.json
dataset: policy_data_all template: qwen cutoff_len: 2414 overwrite_cache: true preprocessing_num_workers: 32
output_dir: /mnt/sde/shixing/LLaMA-Factory/saves/qwen2.5-7B-complete-data logging_steps: 100 save_steps: 0.05 plot_loss: true overwrite_output_dir: true
per_device_train_batch_size: 4 gradient_accumulation_steps: 16 gradient_checkpointing: true learning_rate: 1.0e-5 num_train_epochs: 1.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
val_size: 0.05 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 0.1
No response
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Reminder
System Info
我在尝试训练7b模型的时候,训练初期预计时间是30天,训练半天后速度直接大幅度下降,现在预计需要50天了
但是我的显存还有10g左右未使用,我想知道导致我训练速度下降的原因可能是什么,会和我batch size 设置比较大有关吗
还有想请问下输出的loss值是由什么参数控制的,是logging_steps 吗
Reproduction
model
model_name_or_path: /mnt/sde/shixing/models/Qwen2.5-7B-Instruct
method
stage: sft
do_train: true
finetuning_type: lora
deepspeed: /mnt/sde/shixing/LLaMA-Factory/examples/deepspeed/ds_z0_config.json
dataset
dataset: policy_data_all
template: qwen
cutoff_len: 2414
overwrite_cache: true
preprocessing_num_workers: 32
max_samples: 5000
output
output_dir: /mnt/sde/shixing/LLaMA-Factory/saves/qwen2.5-7B-complete-data
logging_steps: 100
save_steps: 0.05
plot_loss: true
overwrite_output_dir: true
train
per_device_train_batch_size: 4
gradient_accumulation_steps: 16
gradient_checkpointing: true
learning_rate: 1.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
eval
val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 0.1
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: