训练参数以及训练时间疑问求解 #6087

Beyond0831 · 2024-11-20T07:49:23Z

Reminder

I have read the README and searched the existing issues.

System Info

我在尝试训练7b模型的时候，训练初期预计时间是30天，训练半天后速度直接大幅度下降，现在预计需要50天了

但是我的显存还有10g左右未使用，我想知道导致我训练速度下降的原因可能是什么，会和我batch size 设置比较大有关吗

还有想请问下输出的loss值是由什么参数控制的，是logging_steps 吗

Reproduction

model

model_name_or_path: /mnt/sde/shixing/models/Qwen2.5-7B-Instruct

method

stage: sft
do_train: true
finetuning_type: lora
deepspeed: /mnt/sde/shixing/LLaMA-Factory/examples/deepspeed/ds_z0_config.json

dataset

dataset: policy_data_all
template: qwen
cutoff_len: 2414
overwrite_cache: true
preprocessing_num_workers: 32

max_samples: 5000

output

output_dir: /mnt/sde/shixing/LLaMA-Factory/saves/qwen2.5-7B-complete-data
logging_steps: 100
save_steps: 0.05
plot_loss: true
overwrite_output_dir: true

train

per_device_train_batch_size: 4
gradient_accumulation_steps: 16
gradient_checkpointing: true
learning_rate: 1.0e-5
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

eval

val_size: 0.05
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 0.1

Expected behavior

No response

Others

No response

github-actions bot added the pending This problem is yet to be addressed label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

训练参数以及训练时间疑问求解 #6087

训练参数以及训练时间疑问求解 #6087

Beyond0831 commented Nov 20, 2024 •

edited

Loading

训练参数以及训练时间疑问求解 #6087

训练参数以及训练时间疑问求解 #6087

Comments

Beyond0831 commented Nov 20, 2024 • edited Loading

Reminder

System Info

Reproduction

model

method

dataset

max_samples: 5000

output

train

eval

Expected behavior

Others

Beyond0831 commented Nov 20, 2024 •

edited

Loading