Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix total_num_steps #1566

Merged
merged 3 commits into from
May 15, 2024
Merged

Fix total_num_steps #1566

merged 3 commits into from
May 15, 2024

Conversation

bofenghuang
Copy link
Contributor

@bofenghuang bofenghuang commented Apr 24, 2024

Hi @winglian 👋,

Thank you for your excellent work!

I'm not entirely sure if I understand the calculate_total_num_steps function correctly.

Considering that cfg.batch_size already represents the effective batch size (per_device_batch_size * gradient_accumulation_steps * world_size) at this point, as seen in the following snippet of normalize_config function, it seems unnecessary to divide by world_size again when calculating total_num_steps.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/68601ec6ad1cc0e8cb855376586e6eef6a8aa270/src/axolotl/utils/config/__init__.py#L73-L75

If confirmed, this implies that the model is trained on only 1/N of the desired steps when utilizing N GPUs with the current version.

@bofenghuang
Copy link
Contributor Author

bofenghuang commented Apr 24, 2024

Also, IMO, the len(data_loader), which represents the number of micro-batches (by per_device_batch_size ), should be divided only by world_size and gradient_accumulation_steps to obtain the number of steps per epoch.

@winglian
Copy link
Collaborator

thanks @bofenghuang, I believe you are right. Let me do some additional testing and we'll get this merged once the linting is fixed.

@bofenghuang
Copy link
Contributor Author

Thank you for your response @winglian !

I later discovered that this doesn't affect the final training steps, since we pass num_epochs to the trainer and setmax_steps to -1 instead.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/5294653a2d353066600cbc66bb06f7c63c87147b/src/axolotl/core/trainer_builder.py#L1162-L1164

However, this might affect the logic of the warmup steps below.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/5294653a2d353066600cbc66bb06f7c63c87147b/src/axolotl/core/trainer_builder.py#L997-L1003

@winglian
Copy link
Collaborator

winglian commented May 3, 2024

I later discovered that this doesn't affect the final training steps, since we pass num_epochs to the trainer and setmax_steps to -1 instead.

yeah, even with your fix, I noticed the calculation is still off by a few (can be due to lots of factors, esp when using multipack)

@winglian winglian merged commit 81da7d2 into axolotl-ai-cloud:main May 15, 2024
7 checks passed
@bofenghuang bofenghuang deleted the patch-1 branch December 3, 2024 22:30
djsaunde pushed a commit that referenced this pull request Dec 17, 2024
* Fix `total_num_steps`

* Fix total_num_steps

* lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants