[PyTorch] Debug checkpointing with operation-based API #1063

timmoon10 · 2024-07-31T01:56:59Z

Description

This PR debugs checkpointing with the operation-based API (see #707), in particular by adding logic to include FP8 scaling factors in the checkpoint. The checkpointing logic is very similar to the module-based API:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 555 in 5b6546c

def get_extra_state(self) -> torch.Tensor:

TransformerEngine/transformer_engine/pytorch/module/base.py

Line 587 in 5b6546c

def set_extra_state(self, state: torch.Tensor) -> None:

It is admittedly rather unintuitive, but I've added comments to justify the weird behavior.

I've also fixed an orthogonal bug where the linear op was including two copies of its params in its checkpoint.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Add logic for FP8 scales when checkpointing with operation-based API
Fix bug where linear op checkpoint saves params twice

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2024-07-31T01:59:05Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-07-31T22:00:58Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-09-26T02:52:59Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-10-10T00:19:05Z

transformer_engine/pytorch/ops/linear.py

These changes in the Linear op are orthogonal to the rest of this PR. PyTorch modules save checkpoints recursively, so checkpointing a fused op (e.g. Linear) will also checkpoint its constituent basic ops (e.g. BasicLinear, Bias). By registering the weight and bias params with the Linear op, the checkpoints were saving two copies of the params. Converting weight and bias into Python properties avoids this behavior while retaining the existing API.

timmoon10 · 2024-11-05T00:55:49Z

/te-ci pytorch

timmoon10 · 2024-11-05T17:28:23Z

Merging with approval from @ptrendx and @ksivaman.

* Debug checkpointing with operation-based API Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Store checkpoint FP8 state on CPU Signed-off-by: Tim Moon <[email protected]> * Fix bug where linear op was saving params multiple times Signed-off-by: Tim Moon <[email protected]> * Fix linter warnings Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Debug checkpointing with operation-based API

423b811

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added the bug Something isn't working label Jul 31, 2024

timmoon10 requested a review from ksivaman July 31, 2024 01:56

timmoon10 and others added 2 commits July 30, 2024 18:57

Merge branch 'main' into ops-checkpointing

f3a2e5e

[pre-commit.ci] auto fixes from pre-commit.com hooks

f3838af

for more information, see https://pre-commit.ci

timmoon10 and others added 2 commits July 31, 2024 21:57

Store checkpoint FP8 state on CPU

dc833b0

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into ops-checkpointing

b89a054

timmoon10 mentioned this pull request Sep 10, 2024

[WIP] [PyTorch] Proof-of-concept for using operation-based API in modules #1173

Draft

13 tasks

timmoon10 added 5 commits September 19, 2024 19:23

Merge branch 'main' into ops-checkpointing

9e057a3

Merge branch 'main' into ops-checkpointing

6101d48

Fix bug where linear op was saving params multiple times

cd798e8

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into ops-checkpointing

18f8a44

Fix linter warnings

7049ce8

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Sep 27, 2024

Rebase NVIDIA#1063

38ea696

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added a commit to timmoon10/TransformerEngine that referenced this pull request Oct 2, 2024

Rebase NVIDIA#1063

89a33d6

Signed-off-by: Tim Moon <[email protected]>

timmoon10 commented Oct 10, 2024

View reviewed changes

timmoon10 and others added 2 commits October 18, 2024 16:29

Merge branch 'main' into ops-checkpointing

6a32a3d

Merge branch 'main' into ops-checkpointing

8f650d0

Merge branch 'main' into ops-checkpointing

4492f54

timmoon10 merged commit f20d3dd into NVIDIA:main Nov 5, 2024
13 of 14 checks passed

timmoon10 mentioned this pull request Nov 7, 2024

[PyTorch] Fix ONNX export bug with operation-based API #1320

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Debug checkpointing with operation-based API #1063

[PyTorch] Debug checkpointing with operation-based API #1063

timmoon10 commented Jul 31, 2024 •

edited

Loading

timmoon10 commented Jul 31, 2024

timmoon10 commented Jul 31, 2024

timmoon10 commented Sep 26, 2024

timmoon10 Oct 10, 2024

timmoon10 commented Nov 5, 2024

timmoon10 commented Nov 5, 2024

[PyTorch] Debug checkpointing with operation-based API #1063

[PyTorch] Debug checkpointing with operation-based API #1063

Conversation

timmoon10 commented Jul 31, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

timmoon10 commented Jul 31, 2024

timmoon10 commented Jul 31, 2024

timmoon10 commented Sep 26, 2024

timmoon10 Oct 10, 2024

Choose a reason for hiding this comment

timmoon10 commented Nov 5, 2024

timmoon10 commented Nov 5, 2024

timmoon10 commented Jul 31, 2024 •

edited

Loading