diff --git a/_posts/2024-01-03-introduce-flashinfer.md b/_posts/2024-01-03-introduce-flashinfer.md index 9aece28..136a55b 100644 --- a/_posts/2024-01-03-introduce-flashinfer.md +++ b/_posts/2024-01-03-introduce-flashinfer.md @@ -186,7 +186,7 @@ Figure 10: Fused RoPE attention performance, use Llama2-7B setting: um_kv_heads=

RoPE has negligible overhead on all 4 GPUs, especially for RTX 6000 Ada and RTX 4090 GPU which has -strong CUDA Cores performance (RoPE requires `sin`/`cos` computation that can only be accelerated with Tensor Cores). +strong CUDA Cores performance (RoPE requires `sin`/`cos` computation that can not be accelerated with Tensor Cores). ### Low-Precision Attention diff --git a/_posts/2024-01-08-cascade-inference.md b/_posts/2024-01-08-cascade-inference.md index aa63c2e..2d156f0 100644 --- a/_posts/2024-01-08-cascade-inference.md +++ b/_posts/2024-01-08-cascade-inference.md @@ -3,7 +3,7 @@ layout: post title: "Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding" date: 2024-02-02 comments: true -author: Zihao Ye (UW), Ruihang Lai (CMU), Bo-Ru Lu (UW), Chien-Yu Lin (UW), Size Zheng (UW & PKU), Lequn Chen (UW), Tianqi Chen (CMU & OctoML), Luis Ceze (UW & OctoML) +author: Zihao Ye (UW), Ruihang Lai (CMU), Bo-Ru Lu (UW), Chien-Yu Lin (UW), Size Zheng (UW & PKU), Lequn Chen (UW), Tianqi Chen (CMU & OctoAI), Luis Ceze (UW & OctoAI) redirect_from: "/2024/01/08/cascade-inference" ---