[PyTorch] Debug CUDA graph support with operation-based API #1117

timmoon10 · 2024-08-16T01:53:27Z

Description

This PR debugs CUDA graph support with the operation-based API (see #707). The CUDA graph logic is similar to the module-based API.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Debug CUDA graph support with operation-based API
Refactor CUDA graph tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2024-08-16T01:56:21Z

/te-ci pytorch

transformer_engine/pytorch/graph.py

ptrendx · 2024-09-18T22:48:26Z

transformer_engine/pytorch/ops/op.py

+        if fp8_recipe is None:
+            fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
+        if fp8_recipe is None:
+            fp8_recipe = get_default_fp8_recipe()


Hmmmm, this second if looks like logic that should be inside get_fp8_recipe in the FP8GlobalStateManager.

Also, since this is an internal function, couldn't we just always ask for a valid recipe here and just deal with getting it int the caller?

This case shouldn't happen in any of our current use-cases (FP8GlobalStateManager.get_fp8_recipe() is set within fp8_autocast, fp8_recipe is provided within make_graphed_callables), but it seems delicate to rely on that assumption.

ptrendx · 2024-09-18T22:51:26Z

transformer_engine/pytorch/ops/op.py

            if curr_len == amax_history_len:
                continue
+
+            # Reallocate amax history


Could this be its own function?

I've tried to keep this logic similar to how it's handled in the modules:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 410 in 0ee5ccd

def adjust_amax_history_length(self, length: int, fwd: Optional[bool] = None) -> None:

I think it would be nice to consolidate this logic in fp8.py and reuse it for both modules and operations, but that's probably best done in a pure refactor PR.

ptrendx · 2024-09-18T22:52:00Z

transformer_engine/pytorch/ops/op.py

@@ -260,6 +275,21 @@ def _maybe_update_fp8_meta(cls, fp8_meta: Optional[dict[str, Any]]) -> None:
                        pad=(0, 0, 0, amax_history_len - curr_len),
                    )

+            # Update global buffers for amax reductions


This does not look like graph specific thing - was the lack of this in the previous code a bug?

Yep, if the amax history length changes then I don't expect amax reductions to be handled correctly.

Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph. Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-09-20T03:16:47Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2024-09-24T19:25:29Z

/te-ci pytorch

timmoon10 · 2024-10-02T01:47:48Z

/te-ci pytorch

timmoon10 · 2024-10-09T17:41:27Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-11-05T00:56:06Z

/te-ci pytorch

timmoon10 · 2024-11-05T17:27:55Z

Merging with approval from @ptrendx and @ksivaman.

@ptrendx

) * Debug CUDA graph support with operation-based API Signed-off-by: Tim Moon <[email protected]> * Refactoring CUDA graph tests Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestions from @ptrendx Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph. Signed-off-by: Tim Moon <[email protected]> * Avoid unnecessary recursion when saving/loading FP8 state Signed-off-by: Tim Moon <[email protected]> * Fix circular import Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

timmoon10 added 2 commits August 15, 2024 17:22

Debug CUDA graph support with operation-based API

d771ca5

Signed-off-by: Tim Moon <[email protected]>

Refactoring CUDA graph tests

ade0c02

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added the bug Something isn't working label Aug 16, 2024

timmoon10 requested a review from ksivaman August 16, 2024 01:53

[pre-commit.ci] auto fixes from pre-commit.com hooks

e5d40a6

for more information, see https://pre-commit.ci

timmoon10 marked this pull request as ready for review August 16, 2024 01:56

timmoon10 mentioned this pull request Sep 10, 2024

[WIP] [PyTorch] Proof-of-concept for using operation-based API in modules #1173

Draft

13 tasks

ptrendx reviewed Sep 18, 2024

View reviewed changes

transformer_engine/pytorch/graph.py Outdated Show resolved Hide resolved

ptrendx reviewed Sep 18, 2024

View reviewed changes

timmoon10 added 3 commits September 19, 2024 19:35

Merge branch 'main' into cuda-graph-ops

b1972cf

Review suggestions from @ptrendx

7d04de5

Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph. Signed-off-by: Tim Moon <[email protected]>

Avoid unnecessary recursion when saving/loading FP8 state

805abc1

Signed-off-by: Tim Moon <[email protected]>

timmoon10 requested a review from ptrendx September 20, 2024 03:16

timmoon10 and others added 3 commits September 24, 2024 12:07

Merge branch 'main' into cuda-graph-ops

69f66d0

Fix circular import

11e6b45

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f612e10

for more information, see https://pre-commit.ci

Merge branch 'main' into cuda-graph-ops

bdaa5ed

Merge branch 'main' into cuda-graph-ops

3bf06b3

timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Oct 9, 2024

Rebase NVIDIA#1117

ca17ac2

Signed-off-by: Tim Moon <[email protected]>

timmoon10 and others added 2 commits October 18, 2024 16:32

Merge branch 'main' into cuda-graph-ops

04587ac

Merge branch 'main' into cuda-graph-ops

2bd7911

timmoon10 merged commit 50b22da into NVIDIA:main Nov 5, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Debug CUDA graph support with operation-based API #1117

[PyTorch] Debug CUDA graph support with operation-based API #1117

timmoon10 commented Aug 16, 2024

timmoon10 commented Aug 16, 2024

ptrendx Sep 18, 2024

ptrendx Sep 18, 2024

timmoon10 Sep 20, 2024

ptrendx Sep 18, 2024

timmoon10 Sep 20, 2024

ptrendx Sep 18, 2024 •

edited

Loading

timmoon10 Sep 20, 2024

timmoon10 commented Sep 20, 2024

timmoon10 commented Sep 24, 2024

timmoon10 commented Oct 2, 2024

timmoon10 commented Oct 9, 2024

timmoon10 commented Nov 5, 2024

timmoon10 commented Nov 5, 2024

[PyTorch] Debug CUDA graph support with operation-based API #1117

[PyTorch] Debug CUDA graph support with operation-based API #1117

Conversation

timmoon10 commented Aug 16, 2024

Description

Type of change

Changes

Checklist:

timmoon10 commented Aug 16, 2024

ptrendx Sep 18, 2024

Choose a reason for hiding this comment

ptrendx Sep 18, 2024

Choose a reason for hiding this comment

timmoon10 Sep 20, 2024

Choose a reason for hiding this comment

ptrendx Sep 18, 2024

Choose a reason for hiding this comment

timmoon10 Sep 20, 2024

Choose a reason for hiding this comment

ptrendx Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

timmoon10 Sep 20, 2024

Choose a reason for hiding this comment

timmoon10 commented Sep 20, 2024

timmoon10 commented Sep 24, 2024

timmoon10 commented Oct 2, 2024

timmoon10 commented Oct 9, 2024

timmoon10 commented Nov 5, 2024

timmoon10 commented Nov 5, 2024

ptrendx Sep 18, 2024 •

edited

Loading