[Pytorch] NVIDIA-DL-Framework-Inspect support – part 2 – features #1613

pggPL · 2025-03-25T14:17:27Z

Description

The docs for the NVIDIA-DL-Framework-Inspect support. Needs to be merged after part 1.

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>

transformer_engine/debug/features/per_tensor_scaling.py

ksivaman · 2025-04-24T18:18:18Z

transformer_engine/debug/features/fake_quant.py

+#
+# See LICENSE for license information.
+
+"""FakeQuant Feature support for nvidia-dlframework-inspect"""


Notes from review: Make more recipe centric as opposed to format centric.

I think that quant_format will be sufficient in all recipes we plan to add, so I haven't changed that. If something more complex will appear I can add new feature.

transformer_engine/debug/features/_test_dummy_feature.py

transformer_engine/debug/features/disable_fp8_gemm.py

ptrendx · 2025-04-24T23:16:27Z

transformer_engine/debug/features/disable_fp8_gemm.py

+        """API call responsible for choice between high-precision and FP8 GEMM execution."""
+
+        for key in config:
+            if key != "gemm":


I don't think I understand this:

in the docstring you say that the options to this feature is "gemms", but here you look for "gemm". Did you intend this to be just gemm (the variable)?

I don't see how the option provided by the user has any impact on this feature behavior.

It is parsed by the api.py from gemms/gemms_struct to the gemm.

Nvidia-DL-Framework inspect handles enabled keyword, not the features.

transformer_engine/debug/features/disable_fp8_layer.py

transformer_engine/debug/features/utils/stats_computation.py

ptrendx · 2025-04-24T23:26:47Z

transformer_engine/debug/features/utils/stats_computation.py

+        lambda buffers: sum(_get(buffers, "underflows_num")),
+    ),
+    "saturations_num": (
+        lambda x: (x == 126).sum(),


Where does 126 come from?

It's the highest possible value for FP8 - 256 and 127 result in nan's. I added constant to make it clear.

I removed this feature temporarily - nobody used it and Anmol and Shreyas said they added it only to have some fp8_stats features, I will re-add it in another PR.

transformer_engine/debug/features/utils/stats_buffer.py

transformer_engine/debug/features/log_fp8_tensor_stats.py

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

lengerfulluse · 2025-05-07T20:21:05Z

@pggPL thanks for keeping working on this great feature!

Just curious do you have a rough eta of getting it merged? I am thinking whether i will just wait it to merge or pull your PR locally to do some tests. And other question, i saw you remove the saturations part, may i know what's the reason?

ptrendx · 2025-05-07T22:56:17Z

@lengerfulluse we intend to merge this PR ~tomorrow.

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

for more information, see https://pre-commit.ci

pggPL · 2025-05-08T08:43:18Z

I removed saturation part, because it was incorrect - it's more complex than I thought and I will readd it in one of the next PRs.

Signed-off-by: Pawel Gadzinski <[email protected]>

ptrendx

LGTM

ptrendx · 2025-05-08T23:24:51Z

/te-ci pytorch

lengerfulluse · 2025-05-16T21:11:59Z

I removed saturation part, because it was incorrect - it's more complex than I thought and I will readd it in one of the next PRs.

@pggPL I am conducting the underflow statistic fp8 experiments based on your recently merged NVIDIA-DL-Framework-Inspect support with part1 + part2. When i enabled the debug_api, and i can see it was able to log_message into nvdlfw_inspect_logs dir. But when i tried to enable the LogFp8TensorStats like following config:

fp8_tensor_stat_collection:
    enabled: True
    layers:
        layer_name_regex_pattern: .* # Regex pattern selecting all layers
        #layer_types: [layernorm_linear, layernorm_mlp]
    transformer_engine:
        LogFp8TensorStats:
            enabled: True
            tensors: [gradient, activation, weight]
            stats: [underflows%]
            freq: 1
            start_step: 1
            end_step: 10

I got the Unexpected type for quantizer exception at stacked lines:
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/layernorm_linear.py#L203C1-L204C1

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/common.cpp#L35

Looks it can not recognize DebugQuantizer, detailed error as:

[rank5]:   File "/workspace/TransformerEngine-2.2-inspector/transformer_engine/pytorch/module/layernorm_linear.py", line 185, in forward
[rank5]:     ln_out, mu, rsigma = apply_normalization(
[rank5]:   File "/workspace/TransformerEngine-2.2-inspector/transformer_engine/pytorch/module/_common.py", line 85, in apply_normalization
[rank5]:     return normalization_func(
[rank5]: RuntimeError: /TransformerEngine/transformer_engine/pytorch/csrc/common.cpp:34 in function convert_quantizer: Unexpected type for quantizer

I use 25.04 container(cuda 12.9 + rebuilt pytorch 2.6 + TE2.2(with patched inspector PR(1613 + 1614), is there any mis-configuration or some missing? Want to seek your insights.
Thanks again!

pggPL · 2025-05-19T08:34:27Z

Hi @lengerfulluse, thank you for the feedback. We haven't merged the tests yet and it seems that something stopped working. I hope that the PR with the tests will be merged today and this will solve your problem.

lengerfulluse · 2025-05-19T16:07:39Z

Hi @lengerfulluse, thank you for the feedback. We haven't merged the tests yet and it seems that something stopped working. I hope that the PR with the tests will be merged today and this will solve your problem.

And another small import bug in
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/debug/features/api.py#L15

from transformer_engine.pytorch.tensor import all_tensor_types

While the actual definition is

def get_all_tensor_types():

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/tensor/__init__.py#L47

Maybe you are already aware of, just in case.

pggPL and others added 2 commits March 25, 2025 14:15

features drop

66b646f

Signed-off-by: Pawel Gadzinski <[email protected]>

Merge branch 'main' into nvinspect_features

caf755b

ksivaman self-requested a review April 21, 2025 17:05

ksivaman reviewed Apr 24, 2025

View reviewed changes

transformer_engine/debug/features/per_tensor_scaling.py Outdated Show resolved Hide resolved

ksivaman reviewed Apr 24, 2025

View reviewed changes

ptrendx reviewed Apr 24, 2025

View reviewed changes

pggPL and others added 14 commits April 25, 2025 15:32

Update transformer_engine/debug/features/utils/stats_computation.py

0bcb731

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Merge branch 'main' into nvinspect_features

d5aee54

Update transformer_engine/debug/features/disable_fp8_layer.py

d27f11e

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Update transformer_engine/debug/features/log_fp8_tensor_stats.py

ef4bf48

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Update transformer_engine/debug/features/utils/stats_buffer.py

6324762

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Update transformer_engine/debug/features/per_tensor_scaling.py

57ddf5d

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Update transformer_engine/debug/features/per_tensor_scaling.py

50b9e26

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Update transformer_engine/debug/features/disable_fp8_gemm.py

4db1bf3

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Update transformer_engine/debug/features/per_tensor_scaling.py

9ab608d

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

Update transformer_engine/debug/features/per_tensor_scaling.py

df614ad

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

changes

4770c6c

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

74508b0

for more information, see https://pre-commit.ci

temporarily removed saturations

2328f3d

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d6af1b1

for more information, see https://pre-commit.ci

ksivaman previously approved these changes May 7, 2025

View reviewed changes

Update transformer_engine/debug/features/_test_dummy_feature.py

a86e0b9

Co-authored-by: Przemyslaw Tredak <[email protected]> Signed-off-by: Paweł Gadziński <[email protected]>

pggPL dismissed ksivaman’s stale review via a86e0b9 May 8, 2025 08:42

[pre-commit.ci] auto fixes from pre-commit.com hooks

872d3b5

for more information, see https://pre-commit.ci

pggPL and others added 2 commits May 8, 2025 08:46

docs fix

51b6245

Signed-off-by: Pawel Gadzinski <[email protected]>

Merge branch 'main' into nvinspect_features

aee2648

ptrendx approved these changes May 8, 2025

View reviewed changes

ptrendx merged commit b33dd08 into NVIDIA:main May 8, 2025
10 of 12 checks passed

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 2 – features #1613

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 2 – features #1613

Uh oh!

Conversation

pggPL commented Mar 25, 2025

Description

Uh oh!

Uh oh!

ksivaman Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

pggPL Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

pggPL Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ptrendx Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

pggPL Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

pggPL Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lengerfulluse commented May 7, 2025

Uh oh!

ptrendx commented May 7, 2025

Uh oh!

pggPL commented May 8, 2025

Uh oh!

ptrendx left a comment

Choose a reason for hiding this comment

Uh oh!

ptrendx commented May 8, 2025

Uh oh!

Uh oh!

lengerfulluse commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pggPL commented May 19, 2025

Uh oh!

lengerfulluse commented May 19, 2025

Uh oh!

Uh oh!

lengerfulluse commented May 16, 2025 •

edited

Loading