-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters #1066
Conversation
Hi @hXl3s please sign your commits. See https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst#sign-your-work for details. |
994289e
to
a7db770
Compare
Could you also add an error when somebody tries to run THD in inference with KV-cache here: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py#L6732-L6735? |
Also @hXl3s could you add some unit test to make sure it works (e.g. comparing thd vs bshd outputs of TransformerLayer)? |
4f902ec
to
d82386d
Compare
Signed-off-by: Lukasz Pierscieniewski <[email protected]>
Signed-off-by: Lukasz Pierscieniewski <[email protected]>
Signed-off-by: Lukasz Pierscieniewski <[email protected]>
Signed-off-by: Lukasz Pierscieniewski <[email protected]>
for more information, see https://pre-commit.ci
4b2cad7
to
09d9f39
Compare
@ptrendx Added to testcase comparing output of THD vs BSHD for float16 and bfloat16. Additionally, as requested there is an assert now, that check if you do not use THD layout with KV-Cache during inference |
Signed-off-by: Lukasz Pierscieniewski <[email protected]>
Resolved all comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
/te-ci pytorch |
@hXl3s Could you skip the thd test when GPU SM arch is lower than 8.0 (as neither FlashAttention nor cuDNN support those). |
Signed-off-by: Przemek Tredak <[email protected]>
/te-ci pytorch |
…t sequence length parameters (#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Documentation for new parameters Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Fixed tests Signed-off-by: Lukasz Pierscieniewski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Skip the THD test on GPUs older than Ampere Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Lukasz Pierscieniewski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Przemek Tredak <[email protected]>
…t sequence length parameters (NVIDIA#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Documentation for new parameters Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Fixed tests Signed-off-by: Lukasz Pierscieniewski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Skip the THD test on GPUs older than Ampere Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Lukasz Pierscieniewski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Przemek Tredak <[email protected]> Signed-off-by: beinggod <[email protected]>
…t sequence length parameters (#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Documentation for new parameters Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Fixed tests Signed-off-by: Lukasz Pierscieniewski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Skip the THD test on GPUs older than Ampere Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Lukasz Pierscieniewski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Przemek Tredak <[email protected]>
Description
TransformerLayer and MultiheadAttention does not allow for passing arbitrary length sequences. While this feature is supported by DotProductAttention it cannot be controlled by higher abstraction layers. This PR fixes that issue.
Additionally fixes runtime bug when thd layout is used
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
cu_seqlen_q
,cu_seqlens_kv
,max_seqlen_q
andmax_seqlen_kv
to MultiheadAttention and TransformerLayerChecklist: