Open-sourcing custom attention kernels in TensorRT-LLM #3987

juney-nvidia · 2025-04-30T11:52:17Z

juney-nvidia
Apr 30, 2025
Maintainer

Dear Community,

We're excited to share significant updates regarding our custom attention kernels:

Decoding Phase Kernels (XQA) Now Available
The high-performance kernels for decoding-phase attention have been fully open-sourced in PR #3762. This fulfills numerous community requests to enable deeper customization of attention mechanisms.
Prefill Phase Kernels Coming Soon
Our team is actively working to open-source the optimized prefill-phase attention kernels, with completion targeted by mid-May 2025.
Update on May 14: the high-performance kernels for prefill-phase attention have been fully open-sourced in PR #4185!

These releases empower developers to:

Implement novel attention optimizations
Experiment with model architecture modifications
Contribute performance improvements back to the core project

We encourage all community members to explore these components and share your enhancements. Your contributions will help advance LLM inference efficiency across NVIDIA GPUs.

Let's build the future of accelerated LLMs together!

Best regards,
The TensorRT-LLM Team

sfzhu93 · 2025-05-01T18:04:57Z

sfzhu93
May 1, 2025

Thanks for sharing the XQA kernel! This is more straightforward than CUTLASS to understand some of the low-level details. Thus, I got one probably noob question to ask: why there are only wrappers for mma.* and wgmma.* PTX instructions, but wmma.* instructions are not used?

3 replies

juney-nvidia May 3, 2025
Maintainer Author

@lowsfer for vis of this community question :)

June

sfzhu93 May 8, 2025

Thanks! I think I got the answer somewhere else: mma.* and wgmma.* can generally cover all usages of gemm. wmma.* is mostly a legacy instruction.

lowsfer May 15, 2025
Collaborator

Both mma (all arch since Volta) and wgmma (Hopper-only) are direct mapping of hardware instructions. wmma is a wrapper of mma (and other helper instructions) which gives us less control than the lower-level mma. It was exposed before mma went public.

Aya-ZIbra · 2025-05-20T17:36:59Z

Aya-ZIbra
May 20, 2025

Thanks for sharing these kernels! Are the kernels compatible with Blackwell? Any plans for Blackwell-specific optimizations?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Open-sourcing custom attention kernels in TensorRT-LLM #3987

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Open-sourcing custom attention kernels in TensorRT-LLM #3987

Uh oh!

Uh oh!

juney-nvidia Apr 30, 2025 Maintainer

Replies: 2 comments · 3 replies

Uh oh!

Uh oh!

sfzhu93 May 1, 2025

Uh oh!

juney-nvidia May 3, 2025 Maintainer Author

Uh oh!

sfzhu93 May 8, 2025

Uh oh!

Uh oh!

lowsfer May 15, 2025 Collaborator

Uh oh!

Aya-ZIbra May 20, 2025

juney-nvidia
Apr 30, 2025
Maintainer

Replies: 2 comments 3 replies

sfzhu93
May 1, 2025

juney-nvidia May 3, 2025
Maintainer Author

lowsfer May 15, 2025
Collaborator

Aya-ZIbra
May 20, 2025