[PyTorch] Implement Fp8 padding and unpadding module #1129

BeingGod · 2024-08-22T15:14:27Z

Description

Currently the FP8 unpadding backward is implemented by torch.autograd. it involves num_gemms * aten::fill + DtoD call that hurts the MoE model performance in FP8 training. So we implemented Fp8 padding and unpadding module to eliminate the overhead of autograd. The workflow shows below.

Workflow:

I show 2% E2E performance gain in our MoE model.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Implement a kernel multi_padding_kernel which fused multi-tensor padding.
Implement FP8Padding and FP8Unpadding module.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

phu0ngng · 2024-08-26T03:56:25Z

transformer_engine/common/transpose/multi_pad_cast_transpose.cu

Hi,
I see quite a few code duplications between this file and the existing cast_transpose.cu. Is adding a padding option into cast_transpose and calling it from this multi_pad_cast_transpose_kernel possible?

Hi @phu0ngng ,
I think cast_transpose is unnecessary to implement padding.
Reason: In MoE model routing will cause dimension of seq is not multiple of 16 so we should padding it. But for cast_transpose the most of dimension of seq are multiple of 16 (e.g 2048, 4096, 8192...).

I mean we can avoid code duplications between cast_transpose and multi_pad_cast_transpose_kernel by templating cast_transpose so that it does padding when needed. Then from multi_pad_cast_transpose_kernel, one can iterate through the loop and call cast_transpose for each gemm with padding enabled.

cast_transpose with a padding option could be reused in other future features.

Ack.

I'm refactoring code that multi_pad_cast_transpose_kernel maybe removed. But your suggestion is valuable. Thanks a lot.

yaox12

Generally LGTM.
I feed we could have better names for Fp8Padding /Fp8Unpadding, such as MultiPadding/MultiUnpadding. cc @phu0ngng

yaox12 · 2024-08-28T07:59:30Z

transformer_engine/pytorch/utils.py

@@ -221,6 +221,14 @@ def cast_if_needed(tensor: torch.Tensor, dtype: torch.dtype) -> torch.Tensor:
        return tensor if tensor is None or tensor.dtype == dtype else tensor.to(dtype)


+def cast_if_needed_by_actual_dtype(


Seems never been used. Should we remove it?

yaox12 · 2024-08-28T08:00:02Z

transformer_engine/pytorch/module/base.py

@@ -659,6 +659,7 @@ def prepare_forward(
        is_first_microbatch: Union[bool, None],  # pylint: disable=unused-argument
        num_gemms: int = 1,
        allow_non_contiguous: bool = False,
+        with_param: bool = True,


Seems never been used.

yaox12 · 2024-08-28T08:13:56Z

transformer_engine/pytorch/module/fp8_padding.py

+        inputmats = torch.split(inp.view(-1, in_features), m_splits)
+
+        # Allocate cast and transpose output tensor
+        total_row = sum(padded_m_splits)
+        out = torch.empty([total_row, in_features], dtype=inp.dtype, device="cuda")
+        out_list = torch.split(out, padded_m_splits)
+
+        multi_padding_fused(inputmats, padded_m_splits, out_list)


I think it's better to make multi_padding_fused accept whole tensors as inp/out,

inp = inp.view(-1, in_features) multi_padding_fused(inp, m_splits, padded_m_splits, out)

and replace torch.split with stepping the pointers in C++ to reduce CPU overheads, as I'm doing in https://github.com/NVIDIA/TransformerEngine/pull/1128/files#diff-342b0e9e5b472b443484b3a2c4a78647cd72431c272ed44c678ecbe636fb7a3aR191-R194

Ack. Your work is helpful to me.

Thx.

BeingGod · 2024-08-28T11:08:33Z

Generally LGTM. I feed we could have better names for Fp8Padding /Fp8Unpadding, such as MultiPadding/MultiUnpadding. cc @phu0ngng

Hi @yaox12, thanks for your suggestion. I wonder if MultiPadding/MultiUnpadding should be support customized padding number (e.g 8, 32...) ?
Current padding and unpadding module only support padding to 16 (for FP8). So I added Fp8 prefix for padding and unpadding module.

phu0ngng · 2024-08-28T16:25:00Z

Hi @yaox12, thanks for your suggestion. I wonder if MultiPadding/MultiUnpadding should support customized padding numbers (e.g. 8, 32...)? The current padding and unpadding module only supports padding to 16 (for FP8). So I added the Fp8 prefix for padding and unpadding modules.

Are there any numbers on E2E performance gain in your MoE model for other non-FP8 types (BF16 for example) with this padding/unpadding?

BeingGod · 2024-08-28T16:51:57Z

Hi @yaox12, thanks for your suggestion. I wonder if MultiPadding/MultiUnpadding should support customized padding numbers (e.g. 8, 32...)? The current padding and unpadding module only supports padding to 16 (for FP8). So I added the Fp8 prefix for padding and unpadding modules.

Are there any numbers on E2E performance gain in your MoE model for other non-FP8 types (BF16 for example) with this padding/unpadding?

I don't have non-FP8 types performance data now but it is a good idea. I will do some benchmark for non-FP8 types with padding/unpadding.

Update:
Hi @phu0ngng. I have do some benchmark for BF16 with padding/unpadding. I set multiple of padding to 4,8,16. It seems hurt E2E performance that using padding/unpadding for BF16. Perhaps it is meaningless in current that supports other multiple of padding.

phu0ngng · 2024-08-29T14:38:18Z

Hi @phu0ngng. I have done some benchmarks for BF16 with padding/unpadding. I set multiple of padding to 4,8,16. It seems to hurt E2E performance that uses padding/unpadding for BF16. Perhaps it is meaningless in the current that supports other multiple padding.

Agree. I think we can keep FP8 in the name for now. Thanks.

1. Add multi-tensor padding kernel 2. Add FP8Padding and Fp8Unpadding module 3. Add padding grouped linear UT case Signed-off-by: beinggod <[email protected]>

tests/pytorch/test_numerics.py

transformer_engine/common/util/padding.cu

transformer_engine/pytorch/csrc/extensions/padding.cu

Signed-off-by: beinggod <[email protected]>

phu0ngng

LGTM!

phu0ngng · 2024-09-04T16:41:32Z

/te-ci pytorch

BeingGod mentioned this pull request Aug 22, 2024

[PyTorch] make GroupedLinear inp support collection of torch.Tensor #1120

Closed

13 tasks

phu0ngng reviewed Aug 26, 2024

View reviewed changes

BeingGod changed the title ~~[PyTorch] Implement of fused padding, cast and transpose kernel~~ [WIP][PyTorch] Implement of fused padding, cast and transpose kernel Aug 27, 2024

BeingGod force-pushed the dev/zhangrb/fused_multi_pad_cast_transpose branch from a6c845d to 63ef882 Compare August 28, 2024 06:55

BeingGod changed the title ~~[WIP][PyTorch] Implement of fused padding, cast and transpose kernel~~ [PyTorch] Implement of Fp8 padding and unpadding module Aug 28, 2024

BeingGod changed the title ~~[PyTorch] Implement of Fp8 padding and unpadding module~~ [PyTorch] Implement Fp8 padding and unpadding module Aug 28, 2024

yaox12 reviewed Aug 28, 2024

View reviewed changes

BeingGod force-pushed the dev/zhangrb/fused_multi_pad_cast_transpose branch from e652a43 to a1ae467 Compare August 30, 2024 03:15

[PyTorch] Add FP8 padding and unpaading module

c5fd1b4

1. Add multi-tensor padding kernel 2. Add FP8Padding and Fp8Unpadding module 3. Add padding grouped linear UT case Signed-off-by: beinggod <[email protected]>

BeingGod force-pushed the dev/zhangrb/fused_multi_pad_cast_transpose branch from a1ae467 to c5fd1b4 Compare August 30, 2024 03:33

Merge branch 'main' into dev/zhangrb/fused_multi_pad_cast_transpose

4cef386