[PyTorch] Userbuffers support in operation-based API #1142

timmoon10 · 2024-08-27T22:22:47Z

Description

This PR adds basic support in the linear operation for using Userbuffers to overlap tensor-parallel communication with GEMMs. This is implemented as fused operations:

model = te.ops.Sequential(
    te.ops.BasicLinear(...),
    te.ops.Bias(...),
    te.ops.ReduceScatter(...),
)  # Fused into UserbuffersForwardLinear

I've tried to avoid touching the core UB infrastructure in transformer_engine/pytorch/module/base.py, so I've kept the messy API and hackily worked around some bugs. This feature should be considered experimental and unstable.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Add fused operation for linear forward with Userbuffers
Add fused operation for linear backward with Userbuffers

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <[email protected]>

Assumes FP8 RS, which is not a good assumption. Signed-off-by: Tim Moon <[email protected]>

Bias pointers are not properly offset for different data chunks. Also removed logic for FP8 RS. Signed-off-by: Tim Moon <[email protected]>

Test passes with row TP, fails with col TP. Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-08-27T22:41:26Z

/te-ci pytorch

timmoon10 · 2024-09-09T18:40:42Z

/te-ci pytorch

denera

LGTM, at least on the Userbuffers code side.

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-09-26T04:38:47Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-10-09T23:10:45Z

transformer_engine/pytorch/csrc/comm_gemm_overlap.h

The previous UB GEMM+RS impl has a correctness bug. UB splits up the GEMM into multiple chunks so that it can do the GEMM to compute one output chunk at the same time it is doing an RS on another output chunk. However, each output chunk requires applying a different chunk of the bias. We previously used the same bias pointer for all chunks, while this PR computes the correct offsets in the bias pointer.

timmoon10 · 2024-10-09T23:19:56Z

transformer_engine/pytorch/tensor/float8_tensor.py

+        data: Optional[torch.Tensor] = None,
        scale: Optional[torch.Tensor] = None,
        amax: Optional[torch.Tensor] = None,
        scale_inv: Optional[torch.Tensor] = None,
        with_transpose_cache: bool = False,
+        data_transpose: Optional[torch.Tensor] = None,


The data kwarg allows us to easily initialize Float8Tensors that use the UB workspace buffer

The data_transpose kwarg is added for completeness

timmoon10 · 2024-10-09T23:21:54Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2024-10-31T23:05:39Z

/te-ci L1

timmoon10 · 2024-11-01T21:17:53Z

/te-ci pytorch jax paddle L1

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-11-05T21:18:42Z

/te-ci pytorch L1

timmoon10 added 13 commits August 20, 2024 01:09

Add Userbuffers support for column TP linear layer

f2da5eb

Signed-off-by: Tim Moon <[email protected]>

Add Userbuffers support for row TP linear layer

90e0a41

Signed-off-by: Tim Moon <[email protected]>

Interpret linear+RS as row TP linear

a520974

Signed-off-by: Tim Moon <[email protected]>

Add Userbuffers support for FP8 row TP linear layer

bb2e714

Assumes FP8 RS, which is not a good assumption. Signed-off-by: Tim Moon <[email protected]>

Debug bug with incorrect bias pointers in UB GEMM

1e54b88

Bias pointers are not properly offset for different data chunks. Also removed logic for FP8 RS. Signed-off-by: Tim Moon <[email protected]>

Add Userbuffers support for linear dgrad

80b9d42

Test passes with row TP, fails with col TP. Signed-off-by: Tim Moon <[email protected]>

Add Userbuffers support for linear wgrad

e6ad571

Signed-off-by: Tim Moon <[email protected]>

Add support for grad bias

bd5c61e

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into ub-ops

db5a7e2

Fused cast-transpose-dbias

d5f8a8b

Signed-off-by: Tim Moon <[email protected]>

Support case where wgrad is optional

cd0db1c

Signed-off-by: Tim Moon <[email protected]>

Expand documentation

6209910

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into ub-ops

38263fe

timmoon10 requested review from denera and ksivaman August 27, 2024 22:22

pre-commit-ci bot and others added 2 commits August 27, 2024 22:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

a98e2f2

for more information, see https://pre-commit.ci

Fix linter warnings

7aaef65

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into ub-ops

5d7f48a

denera approved these changes Sep 24, 2024

View reviewed changes

timmoon10 added 3 commits September 26, 2024 03:31

Merge branch 'main' into ub-ops

880486f

Signed-off-by: Tim Moon <[email protected]>

Use recently added convenience functions in Float8Tensor

7d8e08b

Signed-off-by: Tim Moon <[email protected]>

Respect autograd dtype

fd4e541

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the ub-ops branch from 3beea2a to fd4e541 Compare September 26, 2024 04:28

pre-commit-ci bot and others added 2 commits September 26, 2024 04:28

[pre-commit.ci] auto fixes from pre-commit.com hooks

a0646d2

for more information, see https://pre-commit.ci

Fix missing imports

d77a9fb

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added 2 commits October 9, 2024 22:40

Merge branch 'main' into ub-ops

706d490

Respect PyT autocast dtype in bprop

1ed735d

Signed-off-by: Tim Moon <[email protected]>

timmoon10 commented Oct 9, 2024

View reviewed changes

timmoon10 and others added 6 commits October 18, 2024 16:52

Merge branch 'main' into ub-ops

c2709d2

Signed-off-by: Tim Moon <[email protected]>

Fix linter warnings

98a6cf4

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into ub-ops

0242d83

Signed-off-by: Tim Moon <[email protected]>

Debug merge conflicts

a297abb

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'ub-ops2' into ub-ops

12ca945

[pre-commit.ci] auto fixes from pre-commit.com hooks

cbb25d6

for more information, see https://pre-commit.ci

Merge branch 'main' into ub-ops

2fab5c7

Merge branch 'main' into ub-ops

0875f24

Signed-off-by: Tim Moon <[email protected]>

timmoon10 merged commit 095b27d into NVIDIA:main Nov 6, 2024
26 checks passed

timmoon10 mentioned this pull request Nov 9, 2024

[PyTorch] Remove special handling for FP8 params in FP8 recipe infrastructure #1326

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Userbuffers support in operation-based API #1142

[PyTorch] Userbuffers support in operation-based API #1142

timmoon10 commented Aug 27, 2024

timmoon10 commented Aug 27, 2024

timmoon10 commented Sep 9, 2024

denera left a comment

timmoon10 commented Sep 26, 2024

timmoon10 Oct 9, 2024

timmoon10 Oct 9, 2024

timmoon10 commented Oct 9, 2024

timmoon10 commented Oct 31, 2024

timmoon10 commented Nov 1, 2024

timmoon10 commented Nov 5, 2024

[PyTorch] Userbuffers support in operation-based API #1142

[PyTorch] Userbuffers support in operation-based API #1142

Conversation

timmoon10 commented Aug 27, 2024

Description

Type of change

Changes

Checklist:

timmoon10 commented Aug 27, 2024

timmoon10 commented Sep 9, 2024

denera left a comment

Choose a reason for hiding this comment

timmoon10 commented Sep 26, 2024

timmoon10 Oct 9, 2024

Choose a reason for hiding this comment

timmoon10 Oct 9, 2024

Choose a reason for hiding this comment

timmoon10 commented Oct 9, 2024

timmoon10 commented Oct 31, 2024

timmoon10 commented Nov 1, 2024

timmoon10 commented Nov 5, 2024