Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch] Debug CUDA graph support with operation-based API #1117

Merged
merged 13 commits into from
Nov 5, 2024

Conversation

timmoon10
Copy link
Collaborator

Description

This PR debugs CUDA graph support with the operation-based API (see #707). The CUDA graph logic is similar to the module-based API.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refractor

Changes

  • Debug CUDA graph support with operation-based API
  • Refactor CUDA graph tests

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@timmoon10 timmoon10 added the bug Something isn't working label Aug 16, 2024
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

if fp8_recipe is None:
fp8_recipe = FP8GlobalStateManager.get_fp8_recipe()
if fp8_recipe is None:
fp8_recipe = get_default_fp8_recipe()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, this second if looks like logic that should be inside get_fp8_recipe in the FP8GlobalStateManager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since this is an internal function, couldn't we just always ask for a valid recipe here and just deal with getting it int the caller?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case shouldn't happen in any of our current use-cases (FP8GlobalStateManager.get_fp8_recipe() is set within fp8_autocast, fp8_recipe is provided within make_graphed_callables), but it seems delicate to rely on that assumption.

if curr_len == amax_history_len:
continue

# Reallocate amax history
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be its own function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to keep this logic similar to how it's handled in the modules:

def adjust_amax_history_length(self, length: int, fwd: Optional[bool] = None) -> None:

I think it would be nice to consolidate this logic in fp8.py and reuse it for both modules and operations, but that's probably best done in a pure refactor PR.

@@ -260,6 +275,21 @@ def _maybe_update_fp8_meta(cls, fp8_meta: Optional[dict[str, Any]]) -> None:
pad=(0, 0, 0, amax_history_len - curr_len),
)

# Update global buffers for amax reductions
Copy link
Member

@ptrendx ptrendx Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not look like graph specific thing - was the lack of this in the previous code a bug?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if the amax history length changes then I don't expect amax reductions to be handled correctly.

Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph.

Signed-off-by: Tim Moon <[email protected]>
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Oct 9, 2024
Signed-off-by: Tim Moon <[email protected]>
@timmoon10
Copy link
Collaborator Author

/te-ci pytorch

@timmoon10
Copy link
Collaborator Author

Merging with approval from @ptrendx and @ksivaman.

@timmoon10 timmoon10 merged commit 50b22da into NVIDIA:main Nov 5, 2024
26 checks passed
phu0ngng pushed a commit to phu0ngng/TransformerEngine that referenced this pull request Nov 5, 2024
)

* Debug CUDA graph support with operation-based API

Signed-off-by: Tim Moon <[email protected]>

* Refactoring CUDA graph tests

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Review suggestions from @ptrendx

Return default recipe from FP8GlobalStateManager.get_fp8_recipe if needed. Expand error message when failing to load FP8 state after capturing CUDA graph.

Signed-off-by: Tim Moon <[email protected]>

* Avoid unnecessary recursion when saving/loading FP8 state

Signed-off-by: Tim Moon <[email protected]>

* Fix circular import

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants