Skip to content

v0.2.1

Compare
Choose a tag to compare
@yzh119 yzh119 released this 13 Feb 08:17
· 40 commits to main since this release
dbb1e4e

What's Changed

  • misc: addressing the package renaming issues by @yzh119 in #770
  • feat: support deepseek prefill attention shape by @yzh119 in #765
  • refactor: change the structure of attention updater by @yzh119 in #772
  • hotfix: follow up of #772 by @yzh119 in #773
  • bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
  • bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
  • ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
  • perf: refactor fa2 prefill template by @yzh119 in #776
  • feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
  • bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
  • misc: remove head dimension 64 from AOT by @yzh119 in #782
  • misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
  • bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
  • refactor: make group_size a part of params by @yzh119 in #786
  • bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
  • fix rope logic in mla decoding by @zhyncs in #793
  • Fix arguments of plan for split QK/VO head dims by @abmfy in #795
  • test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
  • bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
  • Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
  • feat: support f32 attention output in FA2 template by @yzh119 in #799
  • feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
  • bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
  • perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
  • bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
  • doc: add documentation to new MLA interface by @yzh119 in #811
  • feat: unlocking MLA for A100 by @yzh119 in #812
  • feat: cudagraph-compatible MLA API by @yzh119 in #813
  • feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
  • misc: fix sphinx by @abcdabcd987 in #815
  • bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
  • doc: improve mla related documentation by @yzh119 in #818

New Contributors

Full Changelog: v0.2.0.post2...v0.2.1

What's Changed

  • misc: addressing the package renaming issues by @yzh119 in #770
  • feat: support deepseek prefill attention shape by @yzh119 in #765
  • refactor: change the structure of attention updater by @yzh119 in #772
  • hotfix: follow up of #772 by @yzh119 in #773
  • bugfix: Ensure Loop Termination by Enforcing IEEE-754 Compliance in Sampling Kernels by @yzh119 in #774
  • bugfix: fix the JIT warmup arguments in unittests by @yzh119 in #775
  • ci: change whl folder to flashinfer-python by @abcdabcd987 in #779
  • perf: refactor fa2 prefill template by @yzh119 in #776
  • feat: Separate QK/VO head dim dispatch for sm90 AOT by @abcdabcd987 in #778
  • bugfix: fix batch prefill attention kernel unittests by @yzh119 in #781
  • misc: remove head dimension 64 from AOT by @yzh119 in #782
  • misc: allow head_dim=64 for sm90 AOT by @abcdabcd987 in #783
  • bugfix: drop CTA_TILE_Q=32 by @abcdabcd987 in #785
  • refactor: make group_size a part of params by @yzh119 in #786
  • bugfix: MLA decode should multiply sm_scale by math::log2e by @tsu-bin in #787
  • fix rope logic in mla decoding by @zhyncs in #793
  • Fix arguments of plan for split QK/VO head dims by @abmfy in #795
  • test: add unittest comparing deepseek prefill fa2 & 3 implementation by @yzh119 in #797
  • bugfix: fix aot build not compatible with cmake command by @tsu-bin in #796
  • Fix the type annotation of q_dtype and kv_dtype on ragged prefill by @nandor in #798
  • feat: support f32 attention output in FA2 template by @yzh119 in #799
  • feat: apply sm_scale at logits instead of q in FA2 template by @yzh119 in #801
  • bugfix: mla decode failed under cuda graph mode, and update test case by @tsu-bin in #803
  • perf: memory efficient deepseek mla fused page-attention kernel by @yzh119 in #804
  • bugfix: mla page-attention kernel for different page sizes by @yzh119 in #810
  • doc: add documentation to new MLA interface by @yzh119 in #811
  • feat: unlocking MLA for A100 by @yzh119 in #812
  • feat: cudagraph-compatible MLA API by @yzh119 in #813
  • feat: unlock MLA attention for sm89 (L40/L40s/4090) by @yzh119 in #814
  • misc: fix sphinx by @abcdabcd987 in #815
  • bugfix: fix the behavior of mla plan function when provided with host tensors by @yzh119 in #816
  • doc: improve mla related documentation by @yzh119 in #818
  • release: bump version to v0.2.1 by @yzh119 in #819
  • refactor: change to TORCH_LIBRARY by @youkaichao in #764
  • Revert "refactor: change to TORCH_LIBRARY" by @yzh119 in #820
  • bugfix: bugfix on sm89 MLA by @yzh119 in #821
  • hotfix: bugfix on #812 by @yzh119 in #822
  • refactor: change to TORCH_LIBRARY by @abmfy in #823

New Contributors

Full Changelog: v0.2.0.post2...v0.2.1