Skip to content

Releases: DefTruth/Awesome-LLM-Inference

v2.6.6

25 Nov 03:22
40292d7
Compare
Choose a tag to compare

What's Changed

  • Add code link to BPT by @DefTruth in #95
  • add vAttention code link by @KevinZeng08 in #96
  • 🔥[SageAttention] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml) by @DefTruth in #97
  • 🔥[SageAttention-2] SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(@thu-ml) by @DefTruth in #98
  • 🔥[Squeezed Attention] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@uc Berkeley) by @DefTruth in #99
  • 🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference by @DefTruth in #100

New Contributors

Full Changelog: v2.6.5...v2.6.6

v2.6.5

18 Nov 02:53
06c76ad
Compare
Choose a tag to compare

What's Changed

  • Add DP/TP/SP/CP papers with codes by @DefTruth in #92
  • 🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models by @DefTruth in #93
  • 🔥🔥[TP: Comm Compression] Communication Compression for Tensor Parallel LLM Inference by @DefTruth in #94

Full Changelog: v2.6.4...v2.6.5

v2.6.4

13 Nov 07:02
f3f27a7
Compare
Choose a tag to compare

What's Changed

  • 🔥[BitNet] BitNet a4.8: 4-bit Activations for 1-bit LLMs by @DefTruth in #91

Full Changelog: v2.6.3...v2.6.4

v2.6.3

01 Nov 01:18
a854d6c
Compare
Choose a tag to compare

What's Changed

  • 🔥[Fast Best-of-N] Fast Best-of-N Decoding via Speculative Rejection by @DefTruth in #89
  • 🔥[Tensor Product] Acceleration of Tensor-Product Operations with Tensor Cores by @DefTruth in #90

Full Changelog: v2.6.2...v2.6.3

v2.6.2

28 Oct 02:38
613300d
Compare
Choose a tag to compare

What's Changed

  • early exit of LLM inference by @boyi-liu in #85
  • Add paper AdaKV by @FFY0 in #86
  • Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance by @aharshms in #87
  • 🔥[FastAttention] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference by @DefTruth in #88

New Contributors

Full Changelog: v2.6.1...v2.6.2

v2.6.1

14 Oct 05:08
7ba03a6
Compare
Choose a tag to compare

What's Changed

  • [From Author] Link CacheGen and CacheBlend to LMCache by @KuntaiDu in #80
  • 🔥[LORC] Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy by @DefTruth in #81
  • Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation by @DefTruth in #82
  • [LLM Inference] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE by @DefTruth in #83
  • 🔥[PARALLELSPEC] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING by @DefTruth in #84

New Contributors

Full Changelog: v2.6...v2.6.1

v2.6

03 Oct 01:02
c3f1409
Compare
Choose a tag to compare

What's Changed

  • 🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS by @DefTruth in #70
  • fix typo by @DefTruth in #71
  • 🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION by @DefTruth in #72
  • [Low-bit] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms by @DefTruth in #73
  • 🔥🔥[HiFloat8] Ascend HiFloat8 Format for Deep Learning by @DefTruth in #74
  • 🔥[AlignedKV] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization by @DefTruth in #75
  • 🔥🔥[Tensor Cores] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores by @DefTruth in #76
  • 🔥[KV-COMPRESS] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD by @DefTruth in #77
  • 🔥[LayerKV] Optimizing Large Language Model Serving with Layer-wise KV Cache Management by @DefTruth in #78
  • Bump up to v2.6 by @DefTruth in #79

Full Changelog: v2.5...v2.6

v2.5

26 Sep 03:25
3e43647
Compare
Choose a tag to compare

What's Changed

  • 🔥[InstInfer] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference by @DefTruth in #65
  • Update codebase of paper "parallel speculative decoding with adaptive draft length" by @smart-lty in #66
  • move RetrievalAttention -> long context by @DefTruth in #67
  • 🔥🔥[CRITIPREFILL] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS by @DefTruth in #68
  • Bump up to v2.5 by @DefTruth in #69

New Contributors

Full Changelog: v2.4...v2.5

v2.4

18 Sep 05:10
829da5a
Compare
Choose a tag to compare

What's Changed

  • 🔥[RetrievalAttention] Accelerating Long-Context LLM Inference via Vector Retrieval by @DefTruth in #62
  • 🔥[Inf-MLLM] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU by @DefTruth in #63
  • Bump up to v2.4 by @DefTruth in #64

Full Changelog: v2.3...v2.4

v2.3

09 Sep 01:25
f0860e8
Compare
Choose a tag to compare

What's Changed

  • 🔥[CHESS] CHESS : Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification by @DefTruth in #59
  • 🔥[SpMM] High Performance Unstructured SpMM Computation Using Tensor Cores by @DefTruth in #60
  • Bump up to v2.3 by @DefTruth in #61

Full Changelog: v2.2...v2.3