[Pytorch] NVIDIA-DL-Framework-Inspect support – part 3 – tests #1612

pggPL · 2025-03-25T14:07:14Z

Description

Needs to be merged after the parts 1-3.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <[email protected]>

ptrendx · 2025-05-09T17:18:23Z

/te-ci pytorch L1

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

ksivaman · 2025-05-14T04:24:11Z

/te-ci pytorch L0 L1

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

pggPL · 2025-05-14T14:39:15Z

@ksivaman these tests are not included in te-ci, so they need to be run manually. I started the pipeline now.

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Pawel Gadzinski <[email protected]>

for more information, see https://pre-commit.ci

* tests drop Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move dir Signed-off-by: Pawel Gadzinski <[email protected]> * tests fox Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemek Tredak <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

lengerfulluse · 2025-05-27T21:38:30Z

@pggPL good afternoon. I applied some of the bug fix in the part3, but i still encountered some issues, maybe related to more bugs?

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/debug/features/utils/stats_computation.py#L128, underflows_num should changed too as underflows ?
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/debug/pytorch/debug_quantization.py#L322, for this line, i got the error: AttributeError: 'DebugQuantizedTensor' object has no attribute 'to', which looks indicate that the when quantized tensor is DebugQuantizedTensor, the above logic will fail?

Looking forward to your response.

…A#1612) * tests drop Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move dir Signed-off-by: Pawel Gadzinski <[email protected]> * tests fox Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemek Tredak <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

pggPL · 2025-06-09T15:08:53Z

@lengerfulluse thank you for the feedback.

Hmm, it looks that the name is correct. I see some error that will break MXFP8 stats collection - it will be one of the errors fixed in [PyTorch Debug] Fixed the empty tensor bug in statistics computation #1843
This definitively should not happen and our tests should catch it. I will run all standard sanity tests we use for standard layers also to run with debug layers. I hope it will be fixed fast.

lengerfulluse · 2025-06-09T17:55:17Z

@pggPL i still got the following error after applying your above fix. Basically, i use torch 2.6 + cuda 12.8 + TE 2.3 (cherrypicked with your underflow 4 PRs) + MLM 0.12 and NeMo 2.3. So want to check with you what's your verified working environment and package versions for underflow logging? So that i can use your setup to verify.

    File "/TransformerEngine-23/transformer_engine/pytorch/module/layernorm_linear.py", line 734, in backward
      ln_out_total = ctx.input_quantizer(ln_out_total)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 202, in __call__
      return self.quantize(tensor)
    File "/TransformerEngine-23/transformer_engine/debug/pytorch/debug_quantization.py", line 283, in quantize
      quantized_tensor = self.parent_quantizer(tensor)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 202, in __call__
      return self.quantize(tensor)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 191, in quantize
      return _QuantizeFunc.forward(None, tensor, self)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 252, in forward
      return tex.quantize(tensor, quantizer)
  TypeError: quantize(): incompatible function arguments. The following argument types are supported:
      1. (tensor: torch.Tensor, quantizer: object, output: object = None, noop: Optional[torch.Tensor] = None) -> object

  Invoked with: <transformer_engine.debug.pytorch.debug_quantization.DebugQuantizedTensor object at 0x7febe0320fa0>, Float8CurrentScalingQuantizer(rowwise_usage=True, columnwise_usage=True, internal=False, )

pggPL · 2025-06-10T21:34:19Z

My fix above was not intended to fix part (2). I don't think this is configuration problem, it seems to be a bug - I try to replicate the issue tomorrow, if I will not be able to do it, I will try to share some configuration.

I'm not sure that I made my comment on point (2) clearly - it seems to be our issue and it seems that we need more tests. I plan to add them by the end of this week and I hope that this problem will be fixed.

pggPL and others added 3 commits March 25, 2025 14:06

tests drop

52b99c7

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c14dd42

for more information, see https://pre-commit.ci

Merge branch 'main' into nvinspect_tests

7b1f5bd

Signed-off-by: Przemek Tredak <[email protected]>

pggPL and others added 4 commits May 13, 2025 14:47

fix

07534df

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e984a05

for more information, see https://pre-commit.ci

move dir

90a9818

Signed-off-by: Pawel Gadzinski <[email protected]>

Merge branch 'main' into nvinspect_tests

be5faa9

pggPL and others added 4 commits May 14, 2025 15:17

tests fox

5164e3c

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

37c486c

for more information, see https://pre-commit.ci

fix

5274354

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

62ea780

for more information, see https://pre-commit.ci

pggPL and others added 5 commits May 15, 2025 15:53

fix

9a455d6

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

44b773e

for more information, see https://pre-commit.ci

Merge branch 'main' into nvinspect_tests

d691c07

fix

d530ebf

Signed-off-by: Pawel Gadzinski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bd0c37f

for more information, see https://pre-commit.ci

pggPL mentioned this pull request May 19, 2025

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 2 – features #1613

Merged

7 tasks

ptrendx approved these changes May 19, 2025

View reviewed changes

ptrendx merged commit 2645eae into NVIDIA:main May 19, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 3 – tests #1612

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 3 – tests #1612

Uh oh!

pggPL commented Mar 25, 2025

Uh oh!

ptrendx commented May 9, 2025

Uh oh!

ksivaman commented May 14, 2025

Uh oh!

pggPL commented May 14, 2025

Uh oh!

Uh oh!

lengerfulluse commented May 27, 2025 •

edited

Loading

Uh oh!

pggPL commented Jun 9, 2025 •

edited

Loading

Uh oh!

lengerfulluse commented Jun 9, 2025

Uh oh!

pggPL commented Jun 10, 2025

Uh oh!

Uh oh!

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 3 – tests #1612

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 3 – tests #1612

Uh oh!

Conversation

pggPL commented Mar 25, 2025

Description

Type of change

Checklist:

Uh oh!

ptrendx commented May 9, 2025

Uh oh!

ksivaman commented May 14, 2025

Uh oh!

pggPL commented May 14, 2025

Uh oh!

Uh oh!

lengerfulluse commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pggPL commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lengerfulluse commented Jun 9, 2025

Uh oh!

pggPL commented Jun 10, 2025

Uh oh!

Uh oh!

lengerfulluse commented May 27, 2025 •

edited

Loading

pggPL commented Jun 9, 2025 •

edited

Loading