Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pt模式下,关于样本数量、训练步数的不一致的现象 #6133

Open
1 task done
kascas opened this issue Nov 25, 2024 · 2 comments
Open
1 task done

pt模式下,关于样本数量、训练步数的不一致的现象 #6133

kascas opened this issue Nov 25, 2024 · 2 comments
Labels
pending This problem is yet to be addressed

Comments

@kascas
Copy link

kascas commented Nov 25, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-5.4.0-152-generic-x86_64-with-glibc2.35
  • Python version: 3.9.19
  • PyTorch version: 2.3.1+cu121 (GPU)
  • Transformers version: 4.44.0
  • Datasets version: 2.20.0
  • Accelerate version: 1.0.1
  • PEFT version: 0.11.1
  • TRL version: 0.9.4
  • GPU type: NVIDIA GeForce RTX 4090
  • DeepSpeed version: 0.14.4
  • Bitsandbytes version: 0.43.1
  • vLLM version: 0.5.3.post1

Reproduction

配置文件

### model
model_name_or_path: /home/xxx/.cache/modelscope/hub/qwen/Qwen2___5-0___5B

### method
template: qwen
stage: pt
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: mydataset
cutoff_len: 16384
overwrite_cache: true
preprocessing_num_workers: 8

### output
output_dir: saves/Qwen2.5-0.5B/lora/pretrain
logging_steps: 10
save_strategy: epoch
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 4
gradient_accumulation_steps: 16
learning_rate: 1.0e-5
num_train_epochs: 10
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

问题

1. 不同preprocessing_num_workers下,样本数量不一致

本人在src/llamafactory/train/pt/workflow.pyrun_pt打断点调试,尝试了preprocessing_num_workers为8/16/32三种情况,,得到的样本数量分别为603/600/593。

image image image image

2. 样本数量、训练步数以及checkpoint不对应

在下图中,样本数量为605、训练轮数为10、total_batch_size为64。我的理解是正常训练步数应该是605×10/64=94.53,所以应该总共训练95步,或者每轮按照605/64=9.45取整为10,总共训练10×10=100步。但是这里显示只有90步。

image

由于save_strategy设置为epoch,训练后checkpoint对应的step也发生了偏移,甚至最后的两个epoch相隔只有2个step。

lQLPJw5m8Iw3d0nNAezNAiiwMCD2e5eBLQYHIHQHUjR9AA_552_492

Expected behavior

No response

Others

以上问题希望作者能够帮忙解答疑惑,谢谢!

@github-actions github-actions bot added the pending This problem is yet to be addressed label Nov 25, 2024
@hiyouga
Copy link
Owner

hiyouga commented Nov 25, 2024

应该是 (605 // 64) * 10 = 90,一个 epoch 9 个 step。最后差两个是因为训练结束后始终会保存一个 checkpoint

@kascas
Copy link
Author

kascas commented Nov 25, 2024

应该是 (605 // 64) * 10 = 90,一个 epoch 9 个 step。最后差两个是因为训练结束后始终会保存一个 checkpoint

@hiyouga 按照您的思路的话,我的理解是每个epoch对应的step会因为 // 运算导致截断,而位于小数部分的数据会被丢弃/无法训练。

那么请问一下:

  • checkpoint的step计数是否存在问题呢?如果一个epoch对应9个step,那么checkpoint对应的step应该为9/18/27/.../81/90,而不是9/19/29/.../88/90。
  • (605 // 64) 这里小数截断导致的数据丢失是否有办法避免呢?

谢谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants