[PyTorch] Support dtype casting in fused adam #977

Wong4j · 2024-07-01T08:36:56Z

Description

FusedAdam updates the params in-place currently.
This PR adds dtype casting in FusedAdam kernel, in addition to updating the master params in-place, but also can update extra model params. The extra params can be of bf16, fp16, fp8 type.

Update:
I have validated the convergence using GPT training in Megatron-LM. The losses before and after enabling this feature are identical in bits.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Wong4j · 2024-07-12T15:41:33Z

@timmoon10 Could you please take a look?
The corresponding changes to Megatron-LM are in our internal gitlab MR#1736.

timmoon10 · 2024-07-12T20:25:32Z

/te-ci pytorch

Wong4j · 2024-07-15T04:10:47Z

Hi @timmoon10 , I encountered an issue when trying to update scale_inv inside the Adam kernel using *scale_inv_ptr = 1.0f / scale. This resulted in loss not being bit-wise aligned. The reason is that TE/PyTorch compilation uses --use_fast_math, which compiles the reciprocal calculation into a single MUFU.RCP instruction, producing an approximate result rather than an accurate one.
To achieve bit-wise alignment of the loss, I had to update scale_inv outside the Adam kernel. This also leads to suboptimal performance. Do you have any suggestions to address this?

zlsh80826 · 2024-07-15T09:43:08Z

/te-ci pytorch

transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu

transformer_engine/pytorch/csrc/multi_tensor_apply.cuh

transformer_engine/pytorch/csrc/extensions.h

transformer_engine/pytorch/csrc/multi_tensor_apply.cuh

transformer_engine/pytorch/optimizers/fused_adam.py

tests/pytorch/test_fused_optimizer.py

timmoon10 · 2024-07-16T00:47:18Z

tests/pytorch/test_fused_optimizer.py

I notice now that this file uses unittest, while the CI infrastructure uses pytest:

TransformerEngine/qa/L0_pytorch_unittest/test.sh

Line 23 in 8e039fd

pytest -v -s $TE_PATH/tests/pytorch/test_fused_optimizer.py

It may be better to fix that in a separate PR.

transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu

transformer_engine/pytorch/csrc/multi_tensor_apply.cuh

timmoon10 · 2024-07-23T00:07:14Z

/te-ci pytorch

timmoon10 · 2024-07-24T22:19:46Z

Based on a discussion with @ptrendx, I think we should give more thought to the API. While this is primarily targeting Megatron-LM, it's important that other TE users can use it easily without relying on Mcore infrastructure.

@ptrendx's preferred API is for the optimizer to hold the model weights (including Float8Tensors) and to treat the master weights as optimizer state (similar to exp_avg and exp_avg_sq). This is similar to Option 1 in #977 (comment). The workflow should look like:

model = MyModel()  # Mix of fp32, bf16, fp8 params
optim = FusedAdam(model.parameters(), dtype=torch.float32)  # Create FP32 master weights for each non-fp32 param
optim.step()
# optim.state[bf16_param]["exp_avg"] is fp32 tensor
# optim.state[bf16_param]["exp_avg_sq"] is fp32 tensor
# optim.state[bf16_param]["master_param"] is fp32 tensor
# optim.state[fp32_param]["master_param"] is None

This API is more natural for standard PyTorch workflows and it doesn't require maintaining separate model weights/master weights like in Megatron-LM. That said, I can see value in keeping master_weights as an optional kwarg since Megatron-LM already allocates them:

model = MyModel()  # Mix of fp32, bf16, fp8 params
master_weights = [param.float() for param in model.parameters()]
optim = FusedAdam(model.parameters(), dtype=torch.float32, master_weights=master_weights)
# optim.state[param]["master_param"] is from my_master_weights

Wong4j · 2024-08-06T14:51:04Z

Hi @timmoon10 , I have made modifications to the FusedAdam API based on your suggestions. I already tested my changes in Megatron-LM, and the training loss matches the previous results exactly.
However, there are still some issues that need to be discussed:

I have restricted that master_weights must be provided by the user, and the user-provided master_weights must be a list of tensors. If the user does not provide master_weights (i.e., master_weights=None), only the model weights will be updated. Is this approach reasonable?
In Megatron-LM, master_weights are created in the __init__ method of dist opt, while FusedAdam is created earlier. Therefore, I had to initially set master_weights to None, and then modify optimizer.master_weights in the __init__ method of dist opt with the following code:

# create optimizer
optimizer = FusedAdam(param_groups, ... , master_weights=None)
optimizer = DistributedOptimizer(optimizer, *other_args)

# inside __init__ of dist opt
master_weights = list(itertools.chain(*self.shard_fp32_from_float16_groups))
self.optimizer.master_weights = master_weights  # self.optimizer is FusedAdam

This usage is somewhat uncomfortable, but not entirely unusual. Any suggestions?

Kunlun is currently implementing MX-FP16. After some discussion, we believe that it seems more reasonable to place the creation of master_weights inside FusedAdam. This is because exp_avg, exp_avg_sq and master_weight are optimizer states, and since "exp_avg" and "exp_avg_sq" are created and updated within FusedAdam, master_weight should be handled in the same way. However, this change would also conflict with the design logic of Megatron.

Wong4j · 2024-08-13T02:11:21Z

@timmoon10 Could you please take a look?

Signed-off-by: Shijie Wang <[email protected]>

timmoon10

LGTM. Thanks for implementing all the API changes, this is much cleaner and easier to reason about. I think there are still some things that could be improved (options to construct master weights internally, cleaning up how to specify master weights, mixed FP16/BF16, fixing the tests), but those are internal changes that can be worked on later.

transformer_engine/pytorch/optimizers/fused_adam.py

timmoon10 · 2024-08-16T18:00:08Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-08-16T20:02:37Z

/te-ci pytorch

* support dtype casting fusion in FusedAdam Signed-off-by: Shijie Wang <[email protected]> * minor changes Signed-off-by: Shijie Wang <[email protected]> * fix lint Signed-off-by: Shijie Wang <[email protected]> * changes based on review comments Signed-off-by: Shijie Wang <[email protected]> * remove unused code Signed-off-by: Shijie Wang <[email protected]> * code refactor Signed-off-by: Shijie Wang <[email protected]> * fix typo Signed-off-by: Shijie Wang <[email protected]> * refactor Signed-off-by: Shijie Wang <[email protected]> * remove unused code Signed-off-by: Shijie Wang <[email protected]> * Fix linter warnings Signed-off-by: Tim Moon <[email protected]> * Copy CUDA headers for framework sdists Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Shijie Wang <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]> Signed-off-by: beinggod <[email protected]>

Wong4j changed the title ~~Support dtype casting in fused adam~~ [PyTorch] Support dtype casting in fused adam Jul 1, 2024

Wong4j force-pushed the jaywan/add_fused_adam branch from aa11601 to 4277bd1 Compare July 1, 2024 08:47

Wong4j changed the title ~~[PyTorch] Support dtype casting in fused adam~~ [WIP] [PyTorch] Support dtype casting in fused adam Jul 1, 2024

Wong4j force-pushed the jaywan/add_fused_adam branch from fd68cdd to f65a320 Compare July 12, 2024 03:01

Wong4j changed the title ~~[WIP] [PyTorch] Support dtype casting in fused adam~~ [PyTorch] Support dtype casting in fused adam Jul 12, 2024

Wong4j force-pushed the jaywan/add_fused_adam branch from f6f7c49 to b4c90a8 Compare July 12, 2024 15:29

timmoon10 self-requested a review July 12, 2024 20:26

timmoon10 reviewed Jul 16, 2024

View reviewed changes

Wong4j force-pushed the jaywan/add_fused_adam branch from 4c2d42e to 47a448b Compare July 16, 2024 15:33

timmoon10 reviewed Jul 16, 2024

View reviewed changes

transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu Outdated Show resolved Hide resolved

timmoon10 reviewed Jul 16, 2024

View reviewed changes

transformer_engine/pytorch/csrc/multi_tensor_apply.cuh Outdated Show resolved Hide resolved

timmoon10 self-requested a review July 16, 2024 18:27

Wong4j force-pushed the jaywan/add_fused_adam branch 2 times, most recently from c47909c to 5ae1573 Compare July 19, 2024 06:35

Wong4j force-pushed the jaywan/add_fused_adam branch 2 times, most recently from 541e6e7 to ae375cd Compare August 6, 2024 14:21

Wong4j force-pushed the jaywan/add_fused_adam branch 5 times, most recently from ccd54cd to 267c90a Compare August 13, 2024 02:10

Wong4j added 9 commits August 16, 2024 15:30

support dtype casting fusion in FusedAdam

9043630

Signed-off-by: Shijie Wang <[email protected]>

minor changes

58b7ce7

Signed-off-by: Shijie Wang <[email protected]>

fix lint

55e513c

Signed-off-by: Shijie Wang <[email protected]>

changes based on review comments

722b50e

Signed-off-by: Shijie Wang <[email protected]>

remove unused code

7d72b98

Signed-off-by: Shijie Wang <[email protected]>

code refactor

55acf5b

Signed-off-by: Shijie Wang <[email protected]>

fix typo

d12248c

Signed-off-by: Shijie Wang <[email protected]>

refactor

2a607a1

Signed-off-by: Shijie Wang <[email protected]>

remove unused code

44dca61

Signed-off-by: Shijie Wang <[email protected]>

Wong4j force-pushed the jaywan/add_fused_adam branch from 267c90a to 44dca61 Compare August 16, 2024 07:30

timmoon10 approved these changes Aug 16, 2024

View reviewed changes

transformer_engine/pytorch/optimizers/fused_adam.py Outdated Show resolved Hide resolved

timmoon10 added 3 commits August 16, 2024 11:35

Fix linter warnings

1285e64

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into jaywan/add_fused_adam

6eaa731

Copy CUDA headers for framework sdists

e64ce32

Signed-off-by: Tim Moon <[email protected]>

timmoon10 merged commit 4edcff5 into NVIDIA:main Aug 16, 2024
14 of 15 checks passed

timmoon10 mentioned this pull request Oct 31, 2024

Support using fp16 master weights and fp16/fp8 optimizer states in FusedAdam #1078

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Support dtype casting in fused adam #977

[PyTorch] Support dtype casting in fused adam #977

Wong4j commented Jul 1, 2024 •

edited

Loading

Wong4j commented Jul 12, 2024

timmoon10 commented Jul 12, 2024

Wong4j commented Jul 15, 2024 •

edited

Loading

zlsh80826 commented Jul 15, 2024

timmoon10 Jul 16, 2024

timmoon10 commented Jul 23, 2024

timmoon10 commented Jul 24, 2024 •

edited

Loading

Wong4j commented Aug 6, 2024

Wong4j commented Aug 13, 2024

timmoon10 left a comment •

edited

Loading

timmoon10 commented Aug 16, 2024

timmoon10 commented Aug 16, 2024

[PyTorch] Support dtype casting in fused adam #977

[PyTorch] Support dtype casting in fused adam #977

Conversation

Wong4j commented Jul 1, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

Wong4j commented Jul 12, 2024

timmoon10 commented Jul 12, 2024

Wong4j commented Jul 15, 2024 • edited Loading

zlsh80826 commented Jul 15, 2024

timmoon10 Jul 16, 2024

Choose a reason for hiding this comment

timmoon10 commented Jul 23, 2024

timmoon10 commented Jul 24, 2024 • edited Loading

Wong4j commented Aug 6, 2024

Wong4j commented Aug 13, 2024

timmoon10 left a comment • edited Loading

Choose a reason for hiding this comment

timmoon10 commented Aug 16, 2024

timmoon10 commented Aug 16, 2024

Wong4j commented Jul 1, 2024 •

edited

Loading

Wong4j commented Jul 15, 2024 •

edited

Loading

timmoon10 commented Jul 24, 2024 •

edited

Loading

timmoon10 left a comment •

edited

Loading