v0.2.1
What's Changed
- misc: addressing the package renaming issues by @yzh119 in #770
- feat: support deepseek prefill attention shape by @yzh119 in #765
- refactor: change the structure of attention updater by @yzh119 in #772
- hotfix: follow up of #772 by @yzh119 in #773
- bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
- bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
- ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
- perf: refactor fa2 prefill template by @yzh119 in #776
- feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
- bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
- misc: remove head dimension 64 from AOT by @yzh119 in #782
- misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
- bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
- refactor: make
group_size
a part of params by @yzh119 in #786 - bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
- fix rope logic in mla decoding by @zhyncs in #793
- Fix arguments of
plan
for split QK/VO head dims by @abmfy in #795 - test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
- bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
- Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
- feat: support f32 attention output in FA2 template by @yzh119 in #799
- feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
- bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
- perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
- bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
- doc: add documentation to new MLA interface by @yzh119 in #811
- feat: unlocking MLA for A100 by @yzh119 in #812
- feat: cudagraph-compatible MLA API by @yzh119 in #813
- feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
- misc: fix sphinx by @abcdabcd987 in #815
- bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
- doc: improve mla related documentation by @yzh119 in #818
New Contributors
Full Changelog: v0.2.0.post2...v0.2.1
What's Changed
- misc: addressing the package renaming issues by @yzh119 in #770
- feat: support deepseek prefill attention shape by @yzh119 in #765
- refactor: change the structure of attention updater by @yzh119 in #772
- hotfix: follow up of #772 by @yzh119 in #773
- bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
- bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
- ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
- perf: refactor fa2 prefill template by @yzh119 in #776
- feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
- bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
- misc: remove head dimension 64 from AOT by @yzh119 in #782
- misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
- bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
- refactor: make
group_size
a part of params by @yzh119 in #786 - bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
- fix rope logic in mla decoding by @zhyncs in #793
- Fix arguments of
plan
for split QK/VO head dims by @abmfy in #795 - test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
- bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
- Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
- feat: support f32 attention output in FA2 template by @yzh119 in #799
- feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
- bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
- perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
- bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
- doc: add documentation to new MLA interface by @yzh119 in #811
- feat: unlocking MLA for A100 by @yzh119 in #812
- feat: cudagraph-compatible MLA API by @yzh119 in #813
- feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
- misc: fix sphinx by @abcdabcd987 in #815
- bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
- doc: improve mla related documentation by @yzh119 in #818
- release: bump version to v0.2.1 by @yzh119 in #819
- refactor: change to TORCH_LIBRARY by @youkaichao in #764
- Revert "refactor: change to TORCH_LIBRARY" by @yzh119 in #820
- bugfix: bugfix on sm89 MLA by @yzh119 in #821
- hotfix: bugfix on #812 by @yzh119 in #822
- refactor: change to TORCH_LIBRARY by @abmfy in #823
New Contributors
Full Changelog: v0.2.0.post2...v0.2.1