feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters #1066

hXl3s · 2024-07-31T16:44:04Z

Description

TransformerLayer and MultiheadAttention does not allow for passing arbitrary length sequences. While this feature is supported by DotProductAttention it cannot be controlled by higher abstraction layers. This PR fixes that issue.

Additionally fixes runtime bug when thd layout is used

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Added parameters cu_seqlen_q, cu_seqlens_kv, max_seqlen_q and max_seqlen_kv to MultiheadAttention and TransformerLayer
When using THD layout for attention input, shapes are handled correctly

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

ptrendx · 2024-07-31T16:52:38Z

Hi @hXl3s please sign your commits. See https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst#sign-your-work for details.

ptrendx · 2024-08-05T21:52:19Z

Could you also add an error when somebody tries to run THD in inference with KV-cache here: https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/attention.py#L6732-L6735?
@sudhakarsingh27 FYI, since this will touch the pieces you are looking at for merging THD inference support.

ptrendx · 2024-08-05T21:53:22Z

Also @hXl3s could you add some unit test to make sure it works (e.g. comparing thd vs bshd outputs of TransformerLayer)?

Signed-off-by: Lukasz Pierscieniewski <[email protected]>

for more information, see https://pre-commit.ci

hXl3s · 2024-08-12T11:30:34Z

@ptrendx Added to testcase comparing output of THD vs BSHD for float16 and bfloat16.
float32 is skipped as apparently cudnn or other implementation used does not support THD with float32

Additionally, as requested there is an assert now, that check if you do not use THD layout with KV-Cache during inference

transformer_engine/pytorch/attention.py

transformer_engine/pytorch/transformer.py

transformer_engine/pytorch/attention.py

Signed-off-by: Lukasz Pierscieniewski <[email protected]>

hXl3s · 2024-08-13T08:13:29Z

Resolved all comments

ptrendx

LGTM, thanks!

ptrendx · 2024-08-13T20:05:32Z

/te-ci pytorch

ptrendx · 2024-08-14T15:52:09Z

@hXl3s Could you skip the thd test when GPU SM arch is lower than 8.0 (as neither FlashAttention nor cuDNN support those).

Signed-off-by: Przemek Tredak <[email protected]>

ptrendx · 2024-08-16T20:46:02Z

/te-ci pytorch

…t sequence length parameters (#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Documentation for new parameters Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Fixed tests Signed-off-by: Lukasz Pierscieniewski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Skip the THD test on GPUs older than Ampere Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Lukasz Pierscieniewski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Przemek Tredak <[email protected]>

…t sequence length parameters (NVIDIA#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Documentation for new parameters Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Fixed tests Signed-off-by: Lukasz Pierscieniewski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Skip the THD test on GPUs older than Ampere Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Lukasz Pierscieniewski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Przemek Tredak <[email protected]> Signed-off-by: beinggod <[email protected]>

…t sequence length parameters (#1066) * Added ability for seqlen for transformer and mha layer Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Documentation for new parameters Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Add tests for THD layout, assert for THD layout with KV-Cache Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Fixed tests Signed-off-by: Lukasz Pierscieniewski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move THD logic in shape calculation, add missing optional in params Signed-off-by: Lukasz Pierscieniewski <[email protected]> * Skip the THD test on GPUs older than Ampere Signed-off-by: Przemek Tredak <[email protected]> --------- Signed-off-by: Lukasz Pierscieniewski <[email protected]> Signed-off-by: Przemek Tredak <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Przemek Tredak <[email protected]>

ptrendx requested a review from cyanguwa July 31, 2024 16:51

hXl3s force-pushed the transformer_cumseq_len branch from 994289e to a7db770 Compare August 1, 2024 07:03

ptrendx requested a review from ksivaman August 1, 2024 18:43

hXl3s force-pushed the transformer_cumseq_len branch from 4f902ec to d82386d Compare August 12, 2024 11:22

hXl3s and others added 5 commits August 12, 2024 13:28

Added ability for seqlen for transformer and mha layer

b5f4a09

Signed-off-by: Lukasz Pierscieniewski <[email protected]>

Documentation for new parameters

d313b2f

Signed-off-by: Lukasz Pierscieniewski <[email protected]>

Add tests for THD layout, assert for THD layout with KV-Cache

8c29d99

Signed-off-by: Lukasz Pierscieniewski <[email protected]>

Fixed tests

04682c3

Signed-off-by: Lukasz Pierscieniewski <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

09d9f39

for more information, see https://pre-commit.ci

hXl3s force-pushed the transformer_cumseq_len branch from 4b2cad7 to 09d9f39 Compare August 12, 2024 11:28

ptrendx reviewed Aug 12, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

ptrendx reviewed Aug 12, 2024

View reviewed changes

transformer_engine/pytorch/transformer.py Outdated Show resolved Hide resolved

ptrendx reviewed Aug 12, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Show resolved Hide resolved

Move THD logic in shape calculation, add missing optional in params

b00ae04

Signed-off-by: Lukasz Pierscieniewski <[email protected]>

ptrendx approved these changes Aug 13, 2024

View reviewed changes

Merge branch 'main' into transformer_cumseq_len

f32fa19

ptrendx added the 1.10.0 label Aug 16, 2024

Skip the THD test on GPUs older than Ampere

a1779d4

Signed-off-by: Przemek Tredak <[email protected]>

ptrendx merged commit 5d5fe81 into NVIDIA:main Aug 20, 2024
25 of 26 checks passed

Marks101 mentioned this pull request Oct 21, 2024

[PyTorch] MultiheadAttention: Pass cu_seqlens to apply_rotary_pos_emb #1279

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters #1066

feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters #1066

hXl3s commented Jul 31, 2024 •

edited

Loading

ptrendx commented Jul 31, 2024

ptrendx commented Aug 5, 2024

ptrendx commented Aug 5, 2024

hXl3s commented Aug 12, 2024

hXl3s commented Aug 13, 2024

ptrendx left a comment

ptrendx commented Aug 13, 2024

ptrendx commented Aug 14, 2024

ptrendx commented Aug 16, 2024

feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters #1066

feat(pytorch): Allow TransformerLayer and MultiheadAttention to accept sequence length parameters #1066

Conversation

hXl3s commented Jul 31, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

ptrendx commented Jul 31, 2024

ptrendx commented Aug 5, 2024

ptrendx commented Aug 5, 2024

hXl3s commented Aug 12, 2024

hXl3s commented Aug 13, 2024

ptrendx left a comment

Choose a reason for hiding this comment

ptrendx commented Aug 13, 2024

ptrendx commented Aug 14, 2024

ptrendx commented Aug 16, 2024

hXl3s commented Jul 31, 2024 •

edited

Loading