[PyTorch] Fix get_swa_mask() for padding masks #1281

cyanguwa · 2024-10-21T20:33:15Z

Description

This PR fixes the mask generation for sliding window in UnfusedDotProductAttention. It fixes the logic for padding and arbitrary masks in get_swa_mask(), adds more docstring, refactors the call site, and adds more testing in the unit tests.

Fixes #1271

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Improve the logic in get_swa_mask() and its call site

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2024-10-21T21:57:06Z

/te-ci pytorch

Marks101 · 2024-10-22T07:07:28Z

Hi @cyanguwa,
great, I like the idea to have all the masking logic at one place 👍
I just tested this and found a problem with cross attention:

        if "padding" in attn_mask_type:
            if max_seqlen_q == max_seqlen_kv:
                attention_mask = torch.logical_or(
>                   attention_mask.squeeze(1).unsqueeze(3), attention_mask
                )
E               AttributeError: 'tuple' object has no attribute 'squeeze'

The code in UnfusedDotProductAttention made these lines dependent on the attention_type.

cyanguwa · 2024-10-29T21:43:56Z

Hi @cyanguwa, great, I like the idea to have all the masking logic at one place 👍 I just tested this and found a problem with cross attention:

        if "padding" in attn_mask_type:
            if max_seqlen_q == max_seqlen_kv:
                attention_mask = torch.logical_or(
>                   attention_mask.squeeze(1).unsqueeze(3), attention_mask
                )
E               AttributeError: 'tuple' object has no attribute 'squeeze'

The code in UnfusedDotProductAttention made these lines dependent on the attention_type.

Yes, I think I should use if attention_type == "self" here because there could be cross-attention cases where max_seqlen_q == max_seqlen_kv and actual_seqlen_q != actual_seqlen_kv. I'll go through attention.py and see if there're other places I should use attention_type instead.

Let me know if you observe any other issues too! :) Thanks!

cyanguwa · 2024-11-12T22:58:53Z

transformer_engine/pytorch/attention.py

+    is applied, the bottom right corner comes from the [actual_seqlen_q[i], actual_seqlen_kv[i]] matrix,
+    for each batch i, not the [max_seqlen_q, max_seqlen_kv] matrix.::
+
+       attn_mask_type              output shape                                 diagonal alignment


cyanguwa and others added 5 commits October 18, 2024 17:56

WIP: fix get_swa_mask for padding

e36273a

Signed-off-by: Charlene Yang <[email protected]>

fix mask type setting

4b19996

Signed-off-by: Charlene Yang <[email protected]>

fix the order of checking valid swa and changing mask type

7f08d47

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'NVIDIA:main' into fix_swa_mask

3dfe1fe

[pre-commit.ci] auto fixes from pre-commit.com hooks

3a22f93

for more information, see https://pre-commit.ci

cyanguwa mentioned this pull request Oct 21, 2024

[PyTorch] Use or instead of and to combine swa mask with existing mask #1271

Closed

13 tasks

cyanguwa added 2 commits October 21, 2024 14:54

fix lint

5f5c5c3

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into fix_swa_mask

afe721b

cyanguwa commented Nov 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Fix get_swa_mask() for padding masks #1281

[PyTorch] Fix get_swa_mask() for padding masks #1281

cyanguwa commented Oct 21, 2024

cyanguwa commented Oct 21, 2024

Marks101 commented Oct 22, 2024

cyanguwa commented Oct 29, 2024

cyanguwa Nov 12, 2024

[PyTorch] Fix get_swa_mask() for padding masks #1281

Are you sure you want to change the base?

[PyTorch] Fix get_swa_mask() for padding masks #1281

Conversation

cyanguwa commented Oct 21, 2024

Description

Type of change

Changes

Checklist:

cyanguwa commented Oct 21, 2024

Marks101 commented Oct 22, 2024

cyanguwa commented Oct 29, 2024

cyanguwa Nov 12, 2024

Choose a reason for hiding this comment