Fix `total_num_steps` #1566

bofenghuang · 2024-04-24T13:43:49Z

Thank you for your excellent work!

I'm not entirely sure if I understand the calculate_total_num_steps function correctly.

Considering that cfg.batch_size already represents the effective batch size (per_device_batch_size * gradient_accumulation_steps * world_size) at this point, as seen in the following snippet of normalize_config function, it seems unnecessary to divide by world_size again when calculating total_num_steps.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/68601ec6ad1cc0e8cb855376586e6eef6a8aa270/src/axolotl/utils/config/__init__.py#L73-L75

~~If confirmed, this implies that the model is trained on only 1/N of the desired steps when utilizing N GPUs with the current version.~~

bofenghuang · 2024-04-24T19:34:45Z

Also, IMO, the len(data_loader), which represents the number of micro-batches (by per_device_batch_size ), should be divided only by world_size and gradient_accumulation_steps to obtain the number of steps per epoch.

winglian · 2024-04-29T02:06:17Z

thanks @bofenghuang, I believe you are right. Let me do some additional testing and we'll get this merged once the linting is fixed.

bofenghuang · 2024-04-29T09:18:13Z

Thank you for your response @winglian !

I later discovered that this doesn't affect the final training steps, since we pass num_epochs to the trainer and setmax_steps to -1 instead.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/5294653a2d353066600cbc66bb06f7c63c87147b/src/axolotl/core/trainer_builder.py#L1162-L1164

However, this might affect the logic of the warmup steps below.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/5294653a2d353066600cbc66bb06f7c63c87147b/src/axolotl/core/trainer_builder.py#L997-L1003

winglian · 2024-05-03T04:54:35Z

I later discovered that this doesn't affect the final training steps, since we pass num_epochs to the trainer and setmax_steps to -1 instead.

yeah, even with your fix, I noticed the calculation is still off by a few (can be due to lots of factors, esp when using multipack)

* Fix `total_num_steps` * Fix total_num_steps * lint

winglian force-pushed the patch-1 branch from 56876c1 to e2ec88a Compare May 3, 2024 04:54

bofenghuang added 3 commits May 14, 2024 09:28

Fix total_num_steps

2dc61c1

Fix total_num_steps

b219939

lint

517d094

winglian force-pushed the patch-1 branch from e2ec88a to 517d094 Compare May 14, 2024 13:28

winglian merged commit 81da7d2 into axolotl-ai-cloud:main May 15, 2024
7 checks passed

bofenghuang deleted the patch-1 branch December 3, 2024 22:30

djsaunde pushed a commit that referenced this pull request Dec 17, 2024

Fix total_num_steps (#1566)

5398490

* Fix `total_num_steps` * Fix total_num_steps * lint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `total_num_steps` #1566

Fix `total_num_steps` #1566

bofenghuang commented Apr 24, 2024 •

edited

Loading

bofenghuang commented Apr 24, 2024 •

edited

Loading

winglian commented Apr 29, 2024

bofenghuang commented Apr 29, 2024

winglian commented May 3, 2024

Fix total_num_steps #1566

Fix total_num_steps #1566

Conversation

bofenghuang commented Apr 24, 2024 • edited Loading

bofenghuang commented Apr 24, 2024 • edited Loading

winglian commented Apr 29, 2024

bofenghuang commented Apr 29, 2024

winglian commented May 3, 2024

Fix `total_num_steps` #1566

Fix `total_num_steps` #1566

bofenghuang commented Apr 24, 2024 •

edited

Loading

bofenghuang commented Apr 24, 2024 •

edited

Loading