Skip to content

[Pytorch] NVIDIA-DL-Framework-Inspect support – part 2 – features #1613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 8, 2025

Conversation

pggPL
Copy link
Collaborator

@pggPL pggPL commented Mar 25, 2025

Description

The docs for the NVIDIA-DL-Framework-Inspect support. Needs to be merged after part 1.

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ksivaman ksivaman self-requested a review April 21, 2025 17:05
#
# See LICENSE for license information.

"""FakeQuant Feature support for nvidia-dlframework-inspect"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes from review: Make more recipe centric as opposed to format centric.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that quant_format will be sufficient in all recipes we plan to add, so I haven't changed that. If something more complex will appear I can add new feature.

"""API call responsible for choice between high-precision and FP8 GEMM execution."""

for key in config:
if key != "gemm":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand this:

  • in the docstring you say that the options to this feature is "gemms", but here you look for "gemm". Did you intend this to be just gemm (the variable)?
  • I don't see how the option provided by the user has any impact on this feature behavior.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is parsed by the api.py from gemms/gemms_struct to the gemm.

Nvidia-DL-Framework inspect handles enabled keyword, not the features.

lambda buffers: sum(_get(buffers, "underflows_num")),
),
"saturations_num": (
lambda x: (x == 126).sum(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does 126 come from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the highest possible value for FP8 - 256 and 127 result in nan's. I added constant to make it clear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this feature temporarily - nobody used it and Anmol and Shreyas said they added it only to have some fp8_stats features, I will re-add it in another PR.

pggPL and others added 14 commits April 25, 2025 15:32
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
@lengerfulluse
Copy link

@pggPL thanks for keeping working on this great feature!

Just curious do you have a rough eta of getting it merged? I am thinking whether i will just wait it to merge or pull your PR locally to do some tests. And other question, i saw you remove the saturations part, may i know what's the reason?

ksivaman
ksivaman previously approved these changes May 7, 2025
@ptrendx
Copy link
Member

ptrendx commented May 7, 2025

@lengerfulluse we intend to merge this PR ~tomorrow.

Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Paweł Gadziński <[email protected]>
@pggPL
Copy link
Collaborator Author

pggPL commented May 8, 2025

I removed saturation part, because it was incorrect - it's more complex than I thought and I will readd it in one of the next PRs.

Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptrendx
Copy link
Member

ptrendx commented May 8, 2025

/te-ci pytorch

@ptrendx ptrendx merged commit b33dd08 into NVIDIA:main May 8, 2025
10 of 12 checks passed
@lengerfulluse
Copy link

lengerfulluse commented May 16, 2025

I removed saturation part, because it was incorrect - it's more complex than I thought and I will readd it in one of the next PRs.

@pggPL I am conducting the underflow statistic fp8 experiments based on your recently merged NVIDIA-DL-Framework-Inspect support with part1 + part2. When i enabled the debug_api, and i can see it was able to log_message into nvdlfw_inspect_logs dir. But when i tried to enable the LogFp8TensorStats like following config:

fp8_tensor_stat_collection:
    enabled: True
    layers:
        layer_name_regex_pattern: .* # Regex pattern selecting all layers
        #layer_types: [layernorm_linear, layernorm_mlp]
    transformer_engine:
        LogFp8TensorStats:
            enabled: True
            tensors: [gradient, activation, weight]
            stats: [underflows%]
            freq: 1
            start_step: 1
            end_step: 10

I got the Unexpected type for quantizer exception at stacked lines:
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/layernorm_linear.py#L203C1-L204C1

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/csrc/common.cpp#L35

Looks it can not recognize DebugQuantizer, detailed error as:

[rank5]:   File "/workspace/TransformerEngine-2.2-inspector/transformer_engine/pytorch/module/layernorm_linear.py", line 185, in forward
[rank5]:     ln_out, mu, rsigma = apply_normalization(
[rank5]:   File "/workspace/TransformerEngine-2.2-inspector/transformer_engine/pytorch/module/_common.py", line 85, in apply_normalization
[rank5]:     return normalization_func(
[rank5]: RuntimeError: /TransformerEngine/transformer_engine/pytorch/csrc/common.cpp:34 in function convert_quantizer: Unexpected type for quantizer

I use 25.04 container(cuda 12.9 + rebuilt pytorch 2.6 + TE2.2(with patched inspector PR(1613 + 1614), is there any mis-configuration or some missing? Want to seek your insights.
Thanks again!

@pggPL
Copy link
Collaborator Author

pggPL commented May 19, 2025

Hi @lengerfulluse, thank you for the feedback. We haven't merged the tests yet and it seems that something stopped working. I hope that the PR with the tests will be merged today and this will solve your problem.

@lengerfulluse
Copy link

Hi @lengerfulluse, thank you for the feedback. We haven't merged the tests yet and it seems that something stopped working. I hope that the PR with the tests will be merged today and this will solve your problem.

And another small import bug in
https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/debug/features/api.py#L15

from transformer_engine.pytorch.tensor import all_tensor_types

While the actual definition is

def get_all_tensor_types():

https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/tensor/__init__.py#L47

Maybe you are already aware of, just in case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants