Open-sourcing custom attention kernels in TensorRT-LLM #3987
juney-nvidia
announced in
Announcements
Replies: 2 comments 3 replies
-
Thanks for sharing the XQA kernel! This is more straightforward than CUTLASS to understand some of the low-level details. Thus, I got one probably noob question to ask: why there are only wrappers for |
Beta Was this translation helpful? Give feedback.
3 replies
-
Thanks for sharing these kernels! Are the kernels compatible with Blackwell? Any plans for Blackwell-specific optimizations? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear Community,
We're excited to share significant updates regarding our custom attention kernels:
Decoding Phase Kernels (XQA) Now Available
The high-performance kernels for decoding-phase attention have been fully open-sourced in PR #3762. This fulfills numerous community requests to enable deeper customization of attention mechanisms.
Prefill Phase Kernels Coming Soon
Our team is actively working to open-source the optimized prefill-phase attention kernels, with completion targeted by mid-May 2025.
Update on May 14: the high-performance kernels for prefill-phase attention have been fully open-sourced in PR #4185!
These releases empower developers to:
We encourage all community members to explore these components and share your enhancements. Your contributions will help advance LLM inference efficiency across NVIDIA GPUs.
Let's build the future of accelerated LLMs together!
Best regards,
The TensorRT-LLM Team
Beta Was this translation helpful? Give feedback.
All reactions