[PyTorch] Minor optimizations to reduce CPU overheads in modules #1191

timmoon10 · 2024-09-18T23:24:34Z

Description

We have observed that TE modules experience non-trivial CPU overhead, which often becomes a performance bottleneck in the forward pass of small models. For example, measuring the CPU runtime for Megatron-core modules with BF16 compute and TP=1:

ColumnParallelLinear: 74 us per forward pass
TEColumnParallelLinear: 140 us per forward pass

Unfortunately this overhead is distributed throughout the forward pass. Many basic PyTorch operations, e.g. getting attributes from torch.Tensor, involve O(1 us) overhead, so even basic checks to handle all of our advanced features will eventually add up to something non-trivial.

This PR makes a few minor optimizations:

Avoid importing from te.pytorch.cpu_offload in every forward pass
Memoize NCCL process group properties
Avoid custom logic in torch.nn.Module.__setattr__ when possible
Avoid custom logic for accessing params in torch.nn.Module when possible
Avoid accessing tensor attrs more than necessary

I see a 1.22x speedup, with 115 us per forward pass.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Avoid importing from te.pytorch.cpu_offload in every forward pass
Memoize NCCL process group properties
Avoid custom logic in torch.nn.Module.__setattr__ when possible
Avoid custom logic for accessing params in torch.nn.Module when possible
Avoid accessing tensor attrs more than necessary

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Avoid enable_grad context when possible in cast function. Cache distributed group properties. Signed-off-by: Tim Moon <[email protected]>

Avoid torch.nn.Module impl of __setattr__. Signed-off-by: Tim Moon <[email protected]>

Signed-off-by: Tim Moon <[email protected]>

yaox12

Can you propagate the CPU offloading importing fix to GroupedLinear as well?

TransformerEngine/transformer_engine/pytorch/module/grouped_linear.py

Line 783 in c0caadb

from ..cpu_offload import CPUOffloadEnabled

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

ptrendx · 2024-09-19T21:34:30Z

transformer_engine/pytorch/module/base.py

-        self.fp8_parameters = FP8GlobalStateManager.with_fp8_parameters()
-        self.fp8 = FP8GlobalStateManager.is_fp8_enabled()
-        self.fp8_calibration = FP8GlobalStateManager.is_fp8_calibration()
+        self._fast_setattr("fp8_parameters", FP8GlobalStateManager.with_fp8_parameters())


I wonder if we couldn't instead just do something like

te_params = self.get_te_params() # calls _fast_getattr internally, te_params is a normal object te_params.fp8_parameters = FP8GlobalStateManager.with_fp8_parameters()

Otherwise it will be hard to enforce everybody using only the _fast_get/setattr I think.

Even better we could store these attrs in fp8_meta or some other dict. I feel like the behavior of torch.nn.Module is a hint we shouldn't change its attrs frequently.

I tried wrapping these attrs in a property that internally calls _fast_setattr, but that added ~10 us overhead (probably from the extra indirection when getting the attrs). I think it's a good idea to refactor these frequently changed attrs so they are not held directly by the module, but I think that would be beyond the scope of this PR.

Moving this logic into the __setattr__ function makes things a little cleaner. It adds ~2 us overhead, but it's still a win of ~6 us compared to the baseline.

transformer_engine/pytorch/module/layernorm_linear.py

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2024-09-25T22:12:01Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-09-27T00:55:04Z

/te-ci pytorch

transformer_engine/pytorch/module/base.py

transformer_engine/pytorch/module/layernorm.py

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-09-27T23:02:59Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

timmoon10 · 2024-10-01T20:22:06Z

/te-ci pytorch

Signed-off-by: Tim Moon <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2024-10-02T23:45:48Z

/te-ci pytorch

ptrendx · 2024-10-03T21:52:18Z

transformer_engine/pytorch/utils.py

+    if tensor is None:
+        return None
+    if tensor.dtype == dtype:
+        return tensor


Suggested change

if tensor is None:

return None

if tensor.dtype == dtype:

return tensor

if tensor is None or tensor.dtype == dtype

return tensor

timmoon10 added 4 commits September 17, 2024 13:39

CPU perf optimization in linear autograd function

7789c7e

Avoid enable_grad context when possible in cast function. Cache distributed group properties. Signed-off-by: Tim Moon <[email protected]>

CPU perf optimization in prepare_forward function

89628e0

Avoid torch.nn.Module impl of __setattr__. Signed-off-by: Tim Moon <[email protected]>

Avoid module import in TE module forwards

adab597

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into linear-cpu-perf-optimization

fffbcf8

timmoon10 added the enhancement New feature or request label Sep 18, 2024

Use fast getter for params

a4f2b3f

Signed-off-by: Tim Moon <[email protected]>

yaox12 reviewed Sep 19, 2024

View reviewed changes

timmoon10 and others added 3 commits September 19, 2024 00:35

Reuse tensor dims in linear autograd func

1b4ef9f

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c464c5

for more information, see https://pre-commit.ci

Merge branch 'main' into linear-cpu-perf-optimization

5905b01

ptrendx reviewed Sep 19, 2024

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Show resolved Hide resolved

timmoon10 added 4 commits September 20, 2024 14:30

Merge branch 'main' into linear-cpu-perf-optimization

4e6d75b

Apply optimizations to grouped linear

1763bdf

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into linear-cpu-perf-optimization

ab6551a

Debug test failures

92c443e

Signed-off-by: Tim Moon <[email protected]>

timmoon10 force-pushed the linear-cpu-perf-optimization branch from 4d9379d to 92c443e Compare September 25, 2024 22:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

d54cb00

for more information, see https://pre-commit.ci

Debug test failures

b1db2ed

Signed-off-by: Tim Moon <[email protected]>

ksivaman reviewed Sep 27, 2024

View reviewed changes

transformer_engine/pytorch/module/base.py Outdated Show resolved Hide resolved

ptrendx reviewed Sep 27, 2024

View reviewed changes

transformer_engine/pytorch/module/layernorm.py Show resolved Hide resolved

timmoon10 added 2 commits September 27, 2024 15:38

Merge branch 'main' into linear-cpu-perf-optimization

c973cff

Fix linter warnings

df8e7b0

Signed-off-by: Tim Moon <[email protected]>

timmoon10 added 2 commits October 1, 2024 13:18

Avoid deepcopy in tests

706dcdf

Signed-off-by: Tim Moon <[email protected]>

Merge branch 'main' into linear-cpu-perf-optimization

b29bae5

Signed-off-by: Tim Moon <[email protected]>

Move _fast_setattr logic to __setattr__ method

ad13c35

Signed-off-by: Tim Moon <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9458580

for more information, see https://pre-commit.ci

timmoon10 requested review from ptrendx and ksivaman October 3, 2024 00:23

ptrendx reviewed Oct 3, 2024

View reviewed changes

ptrendx approved these changes Oct 3, 2024

View reviewed changes

timmoon10 merged commit 9d976bc into NVIDIA:main Oct 4, 2024
14 of 15 checks passed

timmoon10 mentioned this pull request Oct 26, 2024

[PyTorch] Remove fast param getter from modules #1291

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Minor optimizations to reduce CPU overheads in modules #1191

[PyTorch] Minor optimizations to reduce CPU overheads in modules #1191

timmoon10 commented Sep 18, 2024 •

edited

Loading

yaox12 left a comment

ptrendx Sep 19, 2024

ptrendx Sep 19, 2024

timmoon10 Sep 20, 2024 •

edited

Loading

timmoon10 Sep 25, 2024

timmoon10 Oct 2, 2024

timmoon10 commented Sep 25, 2024

timmoon10 commented Sep 27, 2024

timmoon10 commented Sep 27, 2024

timmoon10 commented Oct 1, 2024

timmoon10 commented Oct 2, 2024

ptrendx Oct 3, 2024

[PyTorch] Minor optimizations to reduce CPU overheads in modules #1191

[PyTorch] Minor optimizations to reduce CPU overheads in modules #1191

Conversation

timmoon10 commented Sep 18, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

yaox12 left a comment

Choose a reason for hiding this comment

ptrendx Sep 19, 2024

Choose a reason for hiding this comment

ptrendx Sep 19, 2024

Choose a reason for hiding this comment

timmoon10 Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

timmoon10 Sep 25, 2024

Choose a reason for hiding this comment

timmoon10 Oct 2, 2024

Choose a reason for hiding this comment

timmoon10 commented Sep 25, 2024

timmoon10 commented Sep 27, 2024

timmoon10 commented Sep 27, 2024

timmoon10 commented Oct 1, 2024

timmoon10 commented Oct 2, 2024

ptrendx Oct 3, 2024

Choose a reason for hiding this comment

timmoon10 commented Sep 18, 2024 •

edited

Loading

timmoon10 Sep 20, 2024 •

edited

Loading