Re-implement FlashAttention with new Xe atoms #547

petercad · 2025-10-04T00:55:15Z

This PR updates FlashAttention to the new copy/MMA atoms.

Changes:

Prefill and decode unified into a single implementation, allowing simultaneous K and Q subgroup-level parallelization rather than an either-or.
GEMMs and softmax grouped together and the full k loop consolidated into an FMHA mainloop class.
- This will facilitate further manual pipelining/overlap of GEMM with softmax.
Use new copy/MMA atoms and reorders to transparently support arbitrary data types.
Automatic copy/MMA operator selection.

Current status: prefill/decode examples almost all working, similar/better performance to old examples.

Known issues:

Head size 192 decode config doesn't compile yet -- to be fixed.
Strange SYCL compiler behavior/bug with tSrS->tArP reorder. Apparently the compiler believes there is UB somewhere and will omit a large section of the kernel as a result. For the moment, there's a direct copy as a workaround while I pin down the issue. I'm not able to reproduce this behavior with the reorder in isolation.

Additional features (causal masking, variable sequence lengths, etc.) to be added later.

Reminder: the new atoms require a very recent driver due to necessary IGC fixes/enhancements. Recommended version: ci-comp_igc-30613.

petercad · 2025-10-04T01:56:37Z

I will break up this large commit into self-contained smaller commits after review is complete.

rolandschulz · 2025-10-06T18:37:16Z

applications/flash_attention_v2/collective/copy_block_slm.hpp

why is this here? This isn't flash attention specific, is it?

No, it's not. These started as some simple helpers to make copying to/from SLM easier for the epilogue. We could move them, maybe to include/cute/algorithm/cute.hpp, though they should be made more sophisticated (use smaller/larger block sizes as appropriate, automatic fallback to scatter/gather, etc.).

rolandschulz · 2025-10-06T18:44:49Z

include/cute/algorithm/subgroup_algorithms.hpp

+//   No diagnostics/error will be issued by the compiler if it is not.
+template <typename T>
+CUTE_HOST_DEVICE void
+set_wi_value(T &x, int i, T val)


why don't you take i as compile time value to make this safer? The usage is on line 137 where the input comes from the unrolled loop index. If you replace the loop with for_each you have a compile time constant.

That is an option -- I did it this way since compile-time unrolling of the loop is IMO harder to use and harder to read.

I opened a compiler ticket for the lack of diagnostics, and they have a patch under review now to address it.

I see. As long as we have diagnostic that's fine. Current solution won't compile for O0. Not sure whether it matters.

rolandschulz · 2025-10-06T19:09:58Z

applications/flash_attention_v2/collective/xe_fmha_fwd_mainloop.hpp

+      for (int VV = 0; VV < VTiles; VV++) {
+        copy(copy_v, tVgV(_,_,_,VV,K), tVrV);
+        reorder(tVrV, tArV);
+        cute::gemm(mma_pv, tArP, tArV, tArA(_,_,_,VV));


why the namespace?

Sometimes it's required to disambiguate the gemm name. I can't remember the exact ambiguity here, but I had to add it.

wuxun-zhang · 2025-10-10T04:18:45Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+    for (; tile_scheduler.is_valid(); ++tile_scheduler) {
+      auto [blk_q, blk_v, head, idx_b] = tile_scheduler.get_block_coord(); // (Q,V,h,b)
+      auto blk_qv = make_coord(blk_q, blk_v);
+      int head_q = head / head_group_q;


In line65 of xe_tile_scheduler.hpp, grid.z is set to batch * num_heads_q, so here head should stand for idx of query heads, it seems we need to calculate head_kv instead of head_q?

int head_group_q = s.num_heads_q / s.num_heads_kv; int head_kv = head / head_group_q;

Edit:

In my local test, after applying all suggested changes, now it works well with correctness check passed.

wuxun-zhang · 2025-10-10T05:10:40Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+
+    auto &p = params.kernel;
+    ProblemShape const& s = p.shape;
+    int head_group_q = s.num_heads_kv / s.num_heads_q;


Suggested change

int head_group_q = s.num_heads_kv / s.num_heads_q;

int head_group_q = s.num_heads_q / s.num_heads_kv;

wuxun-zhang · 2025-10-10T05:11:24Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+      auto [blk_q, blk_v, head, idx_b] = tile_scheduler.get_block_coord(); // (Q,V,h,b)
+      auto blk_qv = make_coord(blk_q, blk_v);
+      int head_q = head / head_group_q;


Suggested change

auto [blk_q, blk_v, head, idx_b] = tile_scheduler.get_block_coord(); // (Q,V,h,b)

auto blk_qv = make_coord(blk_q, blk_v);

int head_q = head / head_group_q;

auto [blk_q, blk_v, head_q, idx_b] = tile_scheduler.get_block_coord(); // (Q,V,h,b)

auto blk_qv = make_coord(blk_q, blk_v);

int head = head_q / head_group_q;

wuxun-zhang · 2025-10-10T05:14:09Z

applications/flash_attention_v2/kernel/xe_fhma_fwd_kernel.hpp

+
+      // Epilogue
+      CollectiveEpilogue epilogue{params.epilogue, shared_storage.epilogue};
+      epilogue(O(_,_,head,idx_b),


Suggested change

epilogue(O(_,_,head,idx_b),

epilogue(O(_,_,head_q,idx_b),

[Umbrella commit] Re-implement FlashAttention with new Xe atoms

3917f24

petercad changed the title ~~[Umbrella commit] Re-implement FlashAttention with new Xe atoms~~ Re-implement FlashAttention with new Xe atoms Oct 4, 2025

rolandschulz reviewed Oct 6, 2025

View reviewed changes

rolandschulz mentioned this pull request Oct 8, 2025

First version of SDPA Fwd #548

Open

wuxun-zhang reviewed Oct 10, 2025

View reviewed changes

	int head_group_q = s.num_heads_kv / s.num_heads_q;
	int head_group_q = s.num_heads_q / s.num_heads_kv;

Re-implement FlashAttention with new Xe atoms #547

Are you sure you want to change the base?

Re-implement FlashAttention with new Xe atoms #547

Uh oh!

Conversation

petercad commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petercad commented Oct 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wuxun-zhang Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

petercad commented Oct 4, 2025 •

edited

Loading

wuxun-zhang Oct 10, 2025 •

edited

Loading