Skip to content

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 3 – tests #1612

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 19, 2025

Conversation

pggPL
Copy link
Collaborator

@pggPL pggPL commented Mar 25, 2025

Description

Needs to be merged after the parts 1-3.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL and others added 3 commits March 25, 2025 14:06
@ptrendx
Copy link
Member

ptrendx commented May 9, 2025

/te-ci pytorch L1

@ksivaman
Copy link
Member

/te-ci pytorch L0 L1

pggPL and others added 4 commits May 14, 2025 15:17
@pggPL
Copy link
Collaborator Author

pggPL commented May 14, 2025

@ksivaman these tests are not included in te-ci, so they need to be run manually. I started the pipeline now.

@ptrendx ptrendx merged commit 2645eae into NVIDIA:main May 19, 2025
11 checks passed
ptrendx added a commit that referenced this pull request May 19, 2025
* tests drop

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move dir

Signed-off-by: Pawel Gadzinski <[email protected]>

* tests fox

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Przemek Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemek Tredak <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
@lengerfulluse
Copy link

lengerfulluse commented May 27, 2025

@pggPL good afternoon. I applied some of the bug fix in the part3, but i still encountered some issues, maybe related to more bugs?

  1. https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/debug/features/utils/stats_computation.py#L128, underflows_num should changed too as underflows ?
  2. https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/debug/pytorch/debug_quantization.py#L322, for this line, i got the error: AttributeError: 'DebugQuantizedTensor' object has no attribute 'to', which looks indicate that the when quantized tensor is DebugQuantizedTensor, the above logic will fail?

Looking forward to your response.

sudhakarsingh27 pushed a commit to sudhakarsingh27/TransformerEngine that referenced this pull request Jun 2, 2025
…A#1612)

* tests drop

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* move dir

Signed-off-by: Pawel Gadzinski <[email protected]>

* tests fox

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Przemek Tredak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemek Tredak <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>
@pggPL
Copy link
Collaborator Author

pggPL commented Jun 9, 2025

@lengerfulluse thank you for the feedback.

  1. Hmm, it looks that the name is correct. I see some error that will break MXFP8 stats collection - it will be one of the errors fixed in [PyTorch Debug] Fixed the empty tensor bug in statistics computation #1843
  2. This definitively should not happen and our tests should catch it. I will run all standard sanity tests we use for standard layers also to run with debug layers. I hope it will be fixed fast.

@lengerfulluse
Copy link

@pggPL i still got the following error after applying your above fix. Basically, i use torch 2.6 + cuda 12.8 + TE 2.3 (cherrypicked with your underflow 4 PRs) + MLM 0.12 and NeMo 2.3. So want to check with you what's your verified working environment and package versions for underflow logging? So that i can use your setup to verify.

    File "/TransformerEngine-23/transformer_engine/pytorch/module/layernorm_linear.py", line 734, in backward
      ln_out_total = ctx.input_quantizer(ln_out_total)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 202, in __call__
      return self.quantize(tensor)
    File "/TransformerEngine-23/transformer_engine/debug/pytorch/debug_quantization.py", line 283, in quantize
      quantized_tensor = self.parent_quantizer(tensor)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 202, in __call__
      return self.quantize(tensor)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 191, in quantize
      return _QuantizeFunc.forward(None, tensor, self)
    File "/TransformerEngine-23/transformer_engine/pytorch/tensor/quantized_tensor.py", line 252, in forward
      return tex.quantize(tensor, quantizer)
  TypeError: quantize(): incompatible function arguments. The following argument types are supported:
      1. (tensor: torch.Tensor, quantizer: object, output: object = None, noop: Optional[torch.Tensor] = None) -> object

  Invoked with: <transformer_engine.debug.pytorch.debug_quantization.DebugQuantizedTensor object at 0x7febe0320fa0>, Float8CurrentScalingQuantizer(rowwise_usage=True, columnwise_usage=True, internal=False, )

@pggPL
Copy link
Collaborator Author

pggPL commented Jun 10, 2025

My fix above was not intended to fix part (2). I don't think this is configuration problem, it seems to be a bug - I try to replicate the issue tomorrow, if I will not be able to do it, I will try to share some configuration.

I'm not sure that I made my comment on point (2) clearly - it seems to be our issue and it seems that we need more tests. I plan to add them by the end of this week and I hope that this problem will be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants