[PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements #1100

yaox12 · 2024-08-13T04:08:46Z

Description

FP8 MHA with RoPE. Handle Float8Tensor inputs depending on the dtype instead of the fp8_mha flag in DotProductAttention. fp8_mha still ensures the output of DPA is in FP8. With this PR:
- FP8 DPA workflow (unchanged): LayerNormLinear -> DPA (cast BF16 input to FP8, FP8 DPA, cast output to BF16) -> Linear
- FP8 MHA workflow:
  - Without RoPE (unchange): LayerNormLinear (output in FP8) -> DPA (FP8 DPA) -> (FP8 input) Linear.
  - With RoPE (new): LayerNormLinear (output in BF16) -> Apply RoPE (output in BF16) -> DPA (cast BF16 input to FP8, FP8 DPA) -> (FP8 input) Linear
Rename is_first_module_in_mha to fp8_output and add this flag to LayerNormLinear, otherwise even the LayerNormLinear in MLP (after MHA) would produce FP8 outputs when fp8_mha=True.
Avoid index_select ops in cast_to_fp8.
Avoid index_select ops in FP8 DPA. I only modified the fwd functions because in backward the CPU overheads are not exposed.
Changed the way we check strides of k and v to avoid creating PyTorch tensors.
Move transpose to backward for Float8Tensor inputs in Linear.

Timeline

As we can see, this PR greatly reduces the CPU overheads in red boxes.

Before
After

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

timmoon10

Overall I like this approach. The current FP8 MHA impl is brittle since it expects the modules to pass specific combinations of Float8Tensor/torch.Tensor. Adding logic so the modules can do casts internally makes this more flexible.

This is similar to how I envision the operation-based API to work. See how we cast inputs in the linear operation:

TransformerEngine/transformer_engine/pytorch/ops/basic/basic_linear.py

Lines 465 to 488 in ec49a52

    
           if with_fp8_compute and not is_float8_tensor(x_local): 
        
               fp8_dtype = get_fp8_te_dtype( 
        
                   input_fp8_meta["recipe"], 
        
                   fprop_tensor=True, 
        
               ) 
        
               x_fp8 = Float8Tensor( 
        
                   data=torch.empty_like(x_local, dtype=torch.uint8), 
        
                   fp8_meta=input_fp8_meta, 
        
                   fp8_meta_forward=True, 
        
                   fp8_meta_index=0, 
        
                   fp8_dtype=fp8_dtype, 
        
                   fp8_scale_inv=torch.empty([1], dtype=torch.float32, device=device), 
        
                   dtype=dtype, 
        
               ) 
        
               with_cast_transpose = weight.requires_grad 
        
               if tensor_parallel_mode == "column" and sequence_parallel: 
        
                   with_cast_transpose = False 
        
               if with_cast_transpose: 
        
                   x_fp8.cast_transpose_(x_local) 
        
               else: 
        
                   x_fp8.copy_(x_local) 
        
               x_local = x_fp8 
        
           elif not with_fp8_compute and is_float8_tensor(x_local): 
        
               x_local = x_local.from_float8()

We're not there yet, but the goal is to be able to implement FP8 MHA with something like:

model = te.Sequential(
    te.ops.LayerNorm(...),  # fp8 output
    te.ops.Linear(...),
    te.ops.RoPE(...),  # fp8 output
    te.ops.SelfAttention(...),  # fp8 output
    te.ops.Linear(...),
)
with te.fp8_autocast():
    y = model(x)

transformer_engine/pytorch/module/layernorm_linear.py

timmoon10 · 2024-08-13T17:24:16Z

Regarding further optimizations: removing the select operations would be helpful if it's not too difficult. I've observed that they add non-trivial CPU overhead in other cases, so I recommend looking at #865. You should also be aware that I've made significant changes in the cpp_extensions functions in #1083.

The logic for torch.ops.tex_ts is for ONNX exports, which is based on TorchScript. We only register the ops needed for inference, so that's why FP8 cast is registered with TorchScript while FP8 cast-transpose is a plain Pybind11 function

yaox12 · 2024-08-14T01:06:15Z

Regarding further optimizations: removing the select operations would be helpful if it's not too difficult. I've observed that they add non-trivial CPU overhead in other cases, so I recommend looking at #865. You should also be aware that I've made significant changes in the cpp_extensions functions in #1083.

The logic for torch.ops.tex_ts is for ONNX exports, which is based on TorchScript. We only register the ops needed for inference, so that's why FP8 cast is registered with TorchScript while FP8 cast-transpose is a plain Pybind11 function

Thanks for your explanation. cast_to_fp8 is doing index selection in Torchscript and thus calling PyTorch ops. I'll try to move it to plain C++.

transformer_engine/pytorch/attention.py

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/csrc/extensions/attention.cu

Signed-off-by: Xin Yao <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Xin Yao <[email protected]>

timmoon10 · 2024-08-16T16:40:19Z

/te-ci pytorch

timmoon10

LGTM

Signed-off-by: Xin Yao <[email protected]>

transformer_engine/pytorch/attention.py

cyanguwa

LGTM

transformer_engine/pytorch/cpp_extensions/fused_attn.py

transformer_engine/pytorch/attention.py

cyanguwa · 2024-08-20T22:38:49Z

@yaox12 do we have a test that particularly tests the functionality of FP8 MHA + RoPE? The test should be able to answer your question above as well, regarding the FP8GlobalStateManager.get_fp8_recipe().fp8_mha. Thanks.

Signed-off-by: Xin Yao <[email protected]>

yaox12 · 2024-08-21T08:42:00Z

@yaox12 do we have a test that particularly tests the functionality of FP8 MHA + RoPE? The test should be able to answer your question above as well, regarding the FP8GlobalStateManager.get_fp8_recipe().fp8_mha. Thanks.

Thanks. Added RoPE tests.

for more information, see https://pre-commit.ci

transformer_engine/pytorch/attention.py

transformer_engine/pytorch/module/linear.py

Signed-off-by: Xin Yao <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Xin Yao <[email protected]>

transformer_engine/pytorch/module/linear.py

transformer_engine/pytorch/attention.py

cyanguwa · 2024-08-26T21:55:04Z

/te-ci pytorch

Signed-off-by: Xin Yao <[email protected]>

yaox12 · 2024-08-27T02:36:41Z

@cyanguwa I find Flash Attention 3 is not installed in our CI container, so I just skip the FP8 DPA/MHA tests when FA3 is not available, otherwise they will throw the error "no attention backends available".

Another CI failure is tests/pytorch/fused_attn/test_fused_attn.py::test_dpa_mask[mask_9_0-model_configs0-dtype0], and it failes in other PRs too. Seems to be a CI issue.

cyanguwa · 2024-08-27T22:48:37Z

@cyanguwa I find Flash Attention 3 is not installed in our CI container, so I just skip the FP8 DPA/MHA tests when FA3 is not available, otherwise they will throw the error "no attention backends available".

Another CI failure is tests/pytorch/fused_attn/test_fused_attn.py::test_dpa_mask[mask_9_0-model_configs0-dtype0], and it failes in other PRs too. Seems to be a CI issue.

Yes, mask_9_0 is a cuDNN 9.4.0.47 issue, and the FP8 tests are getting fixed in #1141.

yaox12 · 2024-08-30T06:24:30Z

@timmoon10 Can you review above unresolved comments?

timmoon10

LGTM

yaox12 · 2024-09-02T01:02:14Z

@timmoon10 @cyanguwa Can you trigger the CI?

yaox12 · 2024-09-04T05:06:04Z

/te-ci pytorch

Signed-off-by: Xin Yao <[email protected]>

yaox12 · 2024-09-04T23:51:10Z

/te-ci pytorch

yaox12 · 2024-09-05T05:56:16Z

As Tim and Charlene have approved, and all comments have been resolved, and the CI has passed, I'll merge this PR.

yaox12 force-pushed the xiny/fp8_mha_with_rope branch from 4bdddfd to b10d27f Compare August 13, 2024 04:09

timmoon10 reviewed Aug 13, 2024

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Outdated Show resolved Hide resolved

yaox12 force-pushed the xiny/fp8_mha_with_rope branch 3 times, most recently from 6e1334d to a1ba977 Compare August 14, 2024 05:34

yaox12 changed the title ~~[PyTorch] FP8 MHA with RoPE~~ [PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements Aug 14, 2024

yaox12 marked this pull request as ready for review August 14, 2024 05:58

yaox12 commented Aug 14, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

yaox12 force-pushed the xiny/fp8_mha_with_rope branch from a1ba977 to 2af460b Compare August 14, 2024 09:20

timmoon10 self-requested a review August 14, 2024 20:15

timmoon10 reviewed Aug 14, 2024

View reviewed changes

timmoon10 requested review from timmoon10 and cyanguwa August 14, 2024 20:57

yaox12 added 6 commits August 14, 2024 19:33

fp8 mha with rope

59b99ca

Signed-off-by: Xin Yao <[email protected]>

avoid index select in cast ops

c46f82c

Signed-off-by: Xin Yao <[email protected]>

avoid index select in fused_attn_fwd

dafd73f

Signed-off-by: Xin Yao <[email protected]>

rename is_first_module_in_mha to fp8_output

0d2ff34

Signed-off-by: Xin Yao <[email protected]>

resolve comments

0e837c3

Signed-off-by: Xin Yao <[email protected]>

resolve comments

33c3ed6

Signed-off-by: Xin Yao <[email protected]>

yaox12 force-pushed the xiny/fp8_mha_with_rope branch from 2af460b to 33c3ed6 Compare August 15, 2024 05:09

pre-commit-ci bot and others added 2 commits August 15, 2024 05:10

[pre-commit.ci] auto fixes from pre-commit.com hooks

13feabb

for more information, see https://pre-commit.ci

move transpose to backward for fp8 input

ae856e4

Signed-off-by: Xin Yao <[email protected]>

timmoon10 approved these changes Aug 16, 2024

View reviewed changes

fix ut

7e26d22

Signed-off-by: Xin Yao <[email protected]>

yaox12 commented Aug 19, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Show resolved Hide resolved

cyanguwa approved these changes Aug 20, 2024

View reviewed changes

transformer_engine/pytorch/cpp_extensions/fused_attn.py Show resolved Hide resolved

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

resolve comments

521c77a

Signed-off-by: Xin Yao <[email protected]>

yaox12 force-pushed the xiny/fp8_mha_with_rope branch from 1ca1860 to 521c77a Compare August 21, 2024 03:58

yaox12 added 2 commits August 20, 2024 21:44

Merge branch 'main' into xiny/fp8_mha_with_rope

10c6961

update argument list for CP

dd30c2d

Signed-off-by: Xin Yao <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a94b3ad

for more information, see https://pre-commit.ci

timmoon10 reviewed Aug 21, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Show resolved Hide resolved

transformer_engine/pytorch/module/linear.py Outdated Show resolved Hide resolved

timmoon10 self-requested a review August 21, 2024 18:00

yaox12 and others added 4 commits August 25, 2024 18:49

Merge branch 'main' into xiny/fp8_mha_with_rope

bf56399

fix for FA3

400d526

Signed-off-by: Xin Yao <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b935e13

for more information, see https://pre-commit.ci

remove unnecessary copy of scale_inv

9eca369

Signed-off-by: Xin Yao <[email protected]>

yaox12 commented Aug 26, 2024

View reviewed changes

transformer_engine/pytorch/module/linear.py Show resolved Hide resolved

cyanguwa reviewed Aug 26, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

skip fp8 dpa/mha tests when fa3 is not available

e3b75db

Signed-off-by: Xin Yao <[email protected]>

yaox12 added 2 commits August 28, 2024 10:55

Merge branch 'main' into xiny/fp8_mha_with_rope

46d428f

Merge branch 'main' into xiny/fp8_mha_with_rope

6b80dd6

timmoon10 approved these changes Aug 30, 2024

View reviewed changes

yaox12 added 2 commits September 4, 2024 05:21

Merge branch 'main' into xiny/fp8_mha_with_rope

df6132f

Merge branch 'main' into xiny/fp8_mha_with_rope

c017154

fix a merge bug

f9da6d7

Signed-off-by: Xin Yao <[email protected]>

yaox12 merged commit 5fafeb0 into NVIDIA:main Sep 5, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements #1100

[PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements #1100

yaox12 commented Aug 13, 2024 •

edited

Loading

timmoon10 left a comment

timmoon10 commented Aug 13, 2024

yaox12 commented Aug 14, 2024 •

edited

Loading

timmoon10 commented Aug 16, 2024

timmoon10 left a comment

cyanguwa left a comment

cyanguwa commented Aug 20, 2024 •

edited

Loading

yaox12 commented Aug 21, 2024

cyanguwa commented Aug 26, 2024

yaox12 commented Aug 27, 2024 •

edited

Loading

cyanguwa commented Aug 27, 2024 •

edited

Loading

yaox12 commented Aug 30, 2024

timmoon10 left a comment

yaox12 commented Sep 2, 2024

yaox12 commented Sep 4, 2024

yaox12 commented Sep 4, 2024

yaox12 commented Sep 5, 2024

	if with_fp8_compute and not is_float8_tensor(x_local):
	fp8_dtype = get_fp8_te_dtype(
	input_fp8_meta["recipe"],
	fprop_tensor=True,
	)
	x_fp8 = Float8Tensor(
	data=torch.empty_like(x_local, dtype=torch.uint8),
	fp8_meta=input_fp8_meta,
	fp8_meta_forward=True,
	fp8_meta_index=0,
	fp8_dtype=fp8_dtype,
	fp8_scale_inv=torch.empty([1], dtype=torch.float32, device=device),
	dtype=dtype,
	)
	with_cast_transpose = weight.requires_grad
	if tensor_parallel_mode == "column" and sequence_parallel:
	with_cast_transpose = False
	if with_cast_transpose:
	x_fp8.cast_transpose_(x_local)
	else:
	x_fp8.copy_(x_local)
	x_local = x_fp8
	elif not with_fp8_compute and is_float8_tensor(x_local):
	x_local = x_local.from_float8()

[PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements #1100

[PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements #1100

Conversation

yaox12 commented Aug 13, 2024 • edited Loading

Description

Timeline

Type of change

Checklist:

timmoon10 left a comment

Choose a reason for hiding this comment

timmoon10 commented Aug 13, 2024

yaox12 commented Aug 14, 2024 • edited Loading

timmoon10 commented Aug 16, 2024

timmoon10 left a comment

Choose a reason for hiding this comment

cyanguwa left a comment

Choose a reason for hiding this comment

cyanguwa commented Aug 20, 2024 • edited Loading

yaox12 commented Aug 21, 2024

cyanguwa commented Aug 26, 2024

yaox12 commented Aug 27, 2024 • edited Loading

cyanguwa commented Aug 27, 2024 • edited Loading

yaox12 commented Aug 30, 2024

timmoon10 left a comment

Choose a reason for hiding this comment

yaox12 commented Sep 2, 2024

yaox12 commented Sep 4, 2024

yaox12 commented Sep 4, 2024

yaox12 commented Sep 5, 2024

yaox12 commented Aug 13, 2024 •

edited

Loading

yaox12 commented Aug 14, 2024 •

edited

Loading

cyanguwa commented Aug 20, 2024 •

edited

Loading

yaox12 commented Aug 27, 2024 •

edited

Loading

cyanguwa commented Aug 27, 2024 •

edited

Loading