[PyTorch] Add support for cuDNN FusedAttention + THD + CP #885

xrennvidia · 2024-06-03T20:28:11Z

Description

Add support for cuDNN FA+THD+CP

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Changes

Please list the changes introduced in this PR:

add CP support for THD format in FusedAttention, and add the related unit tests

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Xiaowei Ren <[email protected]>

cyanguwa · 2024-06-05T03:58:42Z

/te-ci pytorch

tests/pytorch/fused_attn/run_fused_attn_with_cp.py

transformer_engine/pytorch/attention.py

cyanguwa · 2024-06-05T22:59:53Z

Thanks for submitting the PR. Could you use our template to fill in the PR description please?

# Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

## Type of change

- [ ] Documentation change (change only to the documentation, either a fix or a new content)
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

## Changes

Please list the changes introduced in this PR:

- Change A
- Change B

# Checklist:

- [ ] I have read and followed the [contributing guidelines](https://github.com/NVIDIA/TransformerEngine/blob/main/CONTRIBUTING.rst)
- [ ] The functionality is complete
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes

transformer_engine/pytorch/attention.py

Signed-off-by: Xiaowei Ren <[email protected]>

… into xren/cp_thd

Signed-off-by: Xiaowei Ren <[email protected]>

cyanguwa · 2024-06-06T22:36:13Z

/te-ci pytorch

cyanguwa · 2024-06-07T18:39:30Z

/te-ci pytorch

cyanguwa

LGTM

* add seq_offsets_qkvo for cudnn thd Signed-off-by: Xiaowei Ren <[email protected]> * add seq_offsets_qkvo to AttnFuncWithCP Signed-off-by: Xiaowei Ren <[email protected]> * fix seq_offsets calculation of cudnn thd Signed-off-by: Xiaowei Ren <[email protected]> * remove a thd assert Signed-off-by: Xiaowei Ren <[email protected]> * fix bias for thd test Signed-off-by: Xiaowei Ren <[email protected]> * add thd test for cudnn FA with CP Signed-off-by: Xiaowei Ren <[email protected]> * skip GQA/MQA test for cuDNN THD Signed-off-by: Xiaowei Ren <[email protected]> * make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1 Signed-off-by: Xiaowei Ren <[email protected]> * fix seq_offsets inputs Signed-off-by: Xiaowei Ren <[email protected]> * remove two comments Signed-off-by: Xiaowei Ren <[email protected]> * fix attn mask type for cudnn thd with cp Signed-off-by: Xiaowei Ren <[email protected]> * fix attn_mask_type check Signed-off-by: Xiaowei Ren <[email protected]> * fix attn_mask_type for cudnn fa with thd Signed-off-by: Xiaowei Ren <[email protected]> * fix a typo Signed-off-by: Xiaowei Ren <[email protected]> * fix out dout in bwd Signed-off-by: Xiaowei Ren <[email protected]> * assert cudnn+thd does not support attn bias Signed-off-by: Xiaowei Ren <[email protected]> * check if attn_mask_type has padding Signed-off-by: Xiaowei Ren <[email protected]> * minor change Signed-off-by: Xiaowei Ren <[email protected]> * change cp test batch size to 2 Signed-off-by: Xiaowei Ren <[email protected]> * fix code format Signed-off-by: Xiaowei Ren <[email protected]> * fix two assert info Signed-off-by: Xiaowei Ren <[email protected]> * fix assert comment Signed-off-by: Xiaowei Ren <[email protected]> * fix assert comments Signed-off-by: Xiaowei Ren <[email protected]> * minor fix Signed-off-by: Xiaowei Ren <[email protected]> * fix assert comments Signed-off-by: Xiaowei Ren <[email protected]> --------- Signed-off-by: Xiaowei Ren <[email protected]> Co-authored-by: Charlene Yang <[email protected]> Signed-off-by: Boxiang Wang <[email protected]>

wplf · 2024-08-26T11:37:05Z

Hi, thank you for great works.

Could you tell me which commit hash is to support flash-attn and CP+THD, please?
And if I want to use a different strategy to splitting CP tensor, what should I do?

xrennvidia · 2024-08-26T18:51:23Z

Hi, thank you for great works.

Could you tell me which commit hash is to support flash-attn and CP+THD, please? And if I want to use a different strategy to splitting CP tensor, what should I do?

It has been done way back. Here is the PR. Now, THD + CP should work for both Tri Dao's Flash Attention and cuDNN Fused Attention. You can set NVTE_FLASH_ATTN and NVTE_FUSED_ATTN to select which backend to use.

I do not totally understand your second question. With THD format, CP implementation requires you to split each individual sequence into CP*2 chunks, and assign 2 chunks to each GPU for load balancing (like what we did with BSHD/SBHD format). You can refer here for an example.

wplf · 2024-08-27T03:00:01Z

Hi there, thank you for your kindness reply. Your repo helps me very much.
The second question is my tensor shape is [ F * H * W, H, D], I want to split tensor in H and W dim. So my strategy is different with LLM's load balancing strategy. Could you should show me how to gather tensor in another way before AttnFuncWithCP?

xrennvidia · 2024-08-27T03:07:55Z

Hi there, thank you for your kindness reply. Your repo helps me very much. The second question is my tensor shape is [ F * H * W, H, D], I want to split tensor in H and W dim. So my strategy is different with LLM's load balancing strategy. Could you should show me how to gather tensor in another way before AttnFuncWithCP?

We do not gather tensor explicitly, we do KV P2P in ring topology so that they can be overlapped.
We just added a new implementation that does true KV all-gather (refer here), but it does not support THD format yet.

wplf · 2024-08-27T04:09:52Z

Sorry for the naive question.
I saw the implementation of CP split is below. Could you please tell me how to change the load balancing strategy to naive strategy.
PS: I need to make sure that the tensor in each rank is intact.

    elif qkv_format == "thd":
        piece_q = q_.shape[0] // (world_size)
        seq_idx_q = torch.arange(piece_q * rank, piece_q * (rank + 1)).cuda()
        
        piece_kv = k_.shape[0] // (world_size)
        seq_idx_kv = torch.arange(piece_kv * rank, piece_kv * (rank + 1)).cuda()
        
        # seq_idx_q = tex.thd_get_partitioned_indices(
        #     cu_seqlens_q_padded, q_.shape[0], world_size, rank
        # )
        # seq_idx_kv = tex.thd_get_partitioned_indices(
        #     cu_seqlens_kv_padded, k_.shape[0], world_size, rank
        # )

xrennvidia · 2024-08-27T05:44:56Z

Current CP implementation only can support the load balancing strategy.

Do you have fixed F, H, W? If you do, why don't you change your format to SBDH or BSHD?

Or why do you have to avoid the load balancing strategy? are you doing bi-directional attention? The load balancing strategy also works with bi-directional attention.

wplf · 2024-08-27T05:52:30Z

We want to support [FHW,H D] CP with thd format.
The reason we want to avoid the load balancing strategy is we need to make tensor on each cp_rank is intact and get corresponding H or W or F offsets to compat with 2D/3D rope.

CP=2 [:, 0:H//2, 0:W, :] [:, H//2:H, 0:W, :]
CP=4 [:, 0:H//2, 0:W//2, :] [:, H//2:H, 0:W//2, :] [:, 0:H//2, W//2:W, :] [:, H//2:H, W//2:W, :]

xrennvidia · 2024-08-27T05:56:09Z

Load balancing also can allow you do rope, right?
You only need to make sure rope is initialized with full sequence, and then slice rope in the same way as input in each rank, right?

wplf · 2024-08-27T05:58:16Z

In the case of load balancing, one tensor on each rank has two offsets. which is not supported for our current api.

xrennvidia · 2024-08-27T06:00:13Z

OK, then sorry, our current implementation only can work with load balancing split. To meet your requirements, you probably need to hack AttnFuncWithCP.

wplf · 2024-08-27T06:02:11Z

Thank you very much!

wplf · 2024-08-27T07:05:51Z

Thank you very much, again!.
Could you please tell me the difficulty of hacking AttnFuncWithCP?
I am not familiar with cuda. If this is too hard for me , I will find another way to bypass this problem.

xrennvidia · 2024-08-27T07:42:14Z

This is a question very hard to answer, different people have different feeling of the difficulty. The implementation is at here. You can read it and try to estimate the difficulty by yourself.

It's mainly pytorch code, but THD format indeed has some cuda code.

wplf · 2024-08-28T04:00:02Z

Hi, ren, I'm back, again.
After reading ring attention thoroughly, I've got another conclusion.
If I use the AttnFuncWithCP, not the all-gather one,
and Q, K, V is in the right place, then out is in the right place. The only thing that need to note is to restore whole sequence from cp_groups in the same way as splitting. Ring attention doesn't care the order of iteration. Is that correct?

Gradly, I‘ve test the idea, which does work.
This idea passes
torchrun --nnodes 1 --nproc-per-node 8 tests/pytorch/fused_attn/run_fused_attn_with_cp.py model=cp_1_0 qkv_format=thd

Thank you very much.

xrennvidia · 2024-08-28T22:27:18Z

Hi, ren, I'm back, again. After reading ring attention thoroughly, I've got another conclusion. If I use the AttnFuncWithCP, not the all-gather one, and Q, K, V is in the right place, then out is in the right place. The only thing that need to note is to restore whole sequence from cp_groups in the same way as splitting. Ring attention doesn't care the order of iteration. Is that correct?

Gradly, I‘ve test the idea, which does work. This idea passes torchrun --nnodes 1 --nproc-per-node 8 tests/pytorch/fused_attn/run_fused_attn_with_cp.py model=cp_1_0 qkv_format=thd

Thank you very much.

If you do bidirectional attention (no_mask or padding_no_mask) and there is not padding tokens between each sequence, then current code should work for you even though you do not split sequence in load balancing way. Otherwise, you probably will encounter some issues.

But sounds like you already verified your idea. Glad you find the solution.

xrennvidia added 19 commits May 31, 2024 11:36

add seq_offsets_qkvo for cudnn thd

d9ec133

Signed-off-by: Xiaowei Ren <[email protected]>

add seq_offsets_qkvo to AttnFuncWithCP

9211a7c

Signed-off-by: Xiaowei Ren <[email protected]>

fix seq_offsets calculation of cudnn thd

541268b

Signed-off-by: Xiaowei Ren <[email protected]>

remove a thd assert

eb3f66d

Signed-off-by: Xiaowei Ren <[email protected]>

fix bias for thd test

9ac207a

Signed-off-by: Xiaowei Ren <[email protected]>

add thd test for cudnn FA with CP

4239df9

Signed-off-by: Xiaowei Ren <[email protected]>

skip GQA/MQA test for cuDNN THD

9b309ab

Signed-off-by: Xiaowei Ren <[email protected]>

make sure seq_offsets are computed with qkv_group of hd_hd_hd while CP>1

4891fac

Signed-off-by: Xiaowei Ren <[email protected]>

fix seq_offsets inputs

e886136

Signed-off-by: Xiaowei Ren <[email protected]>

remove two comments

6ae7384

Signed-off-by: Xiaowei Ren <[email protected]>

fix attn mask type for cudnn thd with cp

4a38798

Signed-off-by: Xiaowei Ren <[email protected]>

fix attn_mask_type check

a8807e6

Signed-off-by: Xiaowei Ren <[email protected]>

fix attn_mask_type for cudnn fa with thd

5a6a0a5

Signed-off-by: Xiaowei Ren <[email protected]>

fix a typo

d657950

Signed-off-by: Xiaowei Ren <[email protected]>

fix out dout in bwd

3b8780c

Signed-off-by: Xiaowei Ren <[email protected]>

assert cudnn+thd does not support attn bias

b3997ce

Signed-off-by: Xiaowei Ren <[email protected]>

check if attn_mask_type has padding

e7b9ea7

Signed-off-by: Xiaowei Ren <[email protected]>

Merge branch 'main' into xren/cp_thd

df2e164

minor change

12cd072

Signed-off-by: Xiaowei Ren <[email protected]>

cyanguwa self-requested a review June 4, 2024 00:37

xrennvidia added 2 commits June 3, 2024 18:15

change cp test batch size to 2

cc3eb4f

Signed-off-by: Xiaowei Ren <[email protected]>

fix code format

442f347

Signed-off-by: Xiaowei Ren <[email protected]>