Skip to content

Latest commit

 

History

History
190 lines (156 loc) · 30.6 KB

README.md

File metadata and controls

190 lines (156 loc) · 30.6 KB
description
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving.

Awesome_LLM_System-PaperList

Survey

Paper Keywords Institute (first) Publication Others
Full Stack Optimization for Transformer Inference: a Survey Hardware and software co-design UCB Arxiv
A survey of techniques for optimizing transformer inference Transformer optimization Iowa State Univeristy Journal of Systems Architecture
A Survey on Model Compression for Large Language Models Model Compression UCSD Arxiv
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems Optimization technique: quant, pruning, continuous batching, virtual memory CMU Arxiv
LLM Inference Unveiled: Survey and Roofline Model Insights Performance analysis Infinigence-AI Arxiv LLMViewer
LLM Inference Serving: Survey of Recent Advances and Opportunities Northeastern University Arxiv
Efficient Large Language Models: A Survey The Ohio State University Transactions on Machine Learning Research

Framework

Paper/OpenSource Project Keywords Institute (first) Publication Others
DeepSpeed Infernce: Enabling Efficient Inference of Transformer Models at Unprecedented Scale Deepspeed; Kerenl Fusion MicroSoft SC 2022 Github repo
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference Deepspeed; Split fuse MicroSoft Arxiv Github repo
Efficient Memory Management for Large Language Model Serving with PagedAttention vLLM; pagedAttention UCB SOSP 2023 Github repo
TensorRT-LLM/FastTransformer NVIDIA
lightLLM Shanghai Artifcial Intelligence Laboratory
MLC LLM TVM; Multi-platforms MLC-Team
Text-Generation-Inference(TGI) Huggingface
NanoFlow: Towards Optimal Large Language Model Serving Throughput Distributed, Parallel, and Cluster Computing University of Washington Arxiv Github repo
rtp-llm Alibaba Github repo
Efficiently Programming Large Language Models using SGLang Agent Language UCB Arxiv Github repo
HybridFlow: A Flexible and Efficient RLHF Framework RLHF Training ByteDance Eurosys 2024 Github Repo
ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation RLHF Training THU Arxiv Github Repo
Enabling Parallelism Hot Switching for Efficient Training of Large Language Models Peking University SOSP 2024

Serving

Paper Keywords Institute (first) Publication Others
Fast Distributed Inference Serving for Large Language Models Distributed inference serving PKU Arxiv
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving Pipeline Parallel; Auto parallel UCB OSDI 2023 Github repo
Orca: A Distributed Serving System for Transformer-Based Generative Models Continuous batching Seoul National University OSDI2022
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Multiple Decoding Heads Princeton University Arxiv Github repo
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU Consumer-grade GPU SJTU Arxiv Github repo
LLM in a flash: Efficient Large Language Model Inference with Limited Memory flash; Pruning Apple Arxiv
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline Length Perception NUS NeurIPS 2023 Github repo
S3: Increasing GPU Utilization during Generative Inference for Higher Throughput Harvard University Arxiv
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Decouple PKU OSDI 2024
Splitwise: Efficient generative LLM inference using phase splitting Decouple UW ISCA 2024 Track issue
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU Single GPU Stanford University Arxiv Github repo
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve Decouple GaTech OSDI 2024
SpotServe: Serving Generative Large Language Models on Preemptible Instances Preemptible GPU CMU ASPLOS 2024 Empty Github repo
SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification Tree-based Speculative CMU ASPLOS 2024
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving Cache the multi-turn prefill KV-cache in host-DRAM and SSD NUS ATC 2024
MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving Use spatial-temporal multiplexing method to serve multi-LLMs MMLab Arxiv
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference KV Cache Compression Shanghai Jiao Tong University Arxiv
You Only Cache Once: Decoder-Decoder Architectures for Language Models KV Cache Microsoft Research Arxiv
Better & Faster Large Language Models via Multi-token Prediction Multi-token Prediction Meta Arxiv
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference Decouple Hanyang University ASPLOS 2024
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable LLM Applications SJTU OSDI 2024
Fairness in Serving Large Language Models Fairness; LLM Serving UC Berkeley,Stanford University OSDI 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving KV Cache Moonshot AI Github
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention Pre-fillingfor Long-Context
Dynamic Sparse Attention
Microsoft Arxiv Github repo
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool Memory Pool Huawei Arxiv
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management sparisity Seoul National University OSDI 2024
Llumnix: Dynamic Scheduling for Large Language Model Serving Preemptible GPU Alibaba Group OSDI 2024
PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch Multi-Agent Tsinghua University ATC 2024
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention Sparsity; Long context PKU Arxiv
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference Sparsity; Related token MIT ICML 2024
Accelerating Production LLMs with Combined Token/Embedding Speculators Speculative decoding IBM Research Arxiv Github repo
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference KV Cache Apple Arxiv
Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU Attention Saddles,KV cache Shanghai Jiao Tong University Arxiv
TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text KV Cache for RAG Moore Threads AI Arxiv Github repo
Efficient Streaming Language Models with Attention Sinks StreamingLLM, Static sparsity MIT ICLR 2024 Github repo
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models sparsity attention UT Austin nips 2024
SparQ Attention: Bandwidth-Efficient LLM Inference sparsity attention GraphCore ICML 2024
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention sparsity attention msra arxiv
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval Vector Retrieval msra arxiv
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion KV Cache cross University of Chicago Eurosys
Epic: Efficient Position-Independent Context Caching for Serving Large Language Models Position independent PKU arxiv
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving KV Cache compression University of Chicage sigcomm
SCOPE:OptimizingKey-Value Cache Compression in Long-context Generation Separate handling of prefill and decoding KV Cache SEU arxiv 2024
FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines Heterogeneous pipelines THU arxiv 2024

Operating System

Paper Keywords Institute(first) Publication Others
AIOS: LLM Agent Operating System OS; LLM Agent Rutgers University Arxiv

Transformer accelerate

Paper Keywords Institute (first) Publication Others
TurboTransformers: An Efficient GPU serving System For Transformer Models Tencent PPoPP 2021 Github repo
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness FlashAttention; Online Softmax Stanford University NeurIPS 2023 Github repo
FlashAttention2: Faster Attention with Better Parallelism and Work Partitioning Stanford University Arxiv Github repo
FlashDecoding++: Faster Large Language Model Inference on GPUs Softmax with Unified Maximum Value Tsinghua University Mlsys 2024
FlashFFTConv: Efficient Convolutions for Long Sentences with Tensor Cores FFT; TensorCore; Long Sentences Stanford University Arxiv Github repo
FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks Georgia Institute of Technology ASPLOS 2023
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs Variable-Length Inputs UCR PPoPP 2022 Github repo
Fast Transformer Decoding: One Write-Head is All You Need MQA Google Arxiv
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints GQA Google Research ACL 2023
LightSeq: A High Performance Inference Library for Transformers ByteDance NAACL 2021 Github repo
LightSeq2: LightSeq2: Accelerated Training for Transformer-based Models on GPUs ByteDance SC 2022
Blockwise Parallel Transformer for Large Context Models Blockwise transformer UCB NeurIPS 2023 Github repo
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention Dynamic Memory Management Microsoft Research India Arxiv

Model Compression

Quant and Pruning

Paper Keywords Institute (first) Publication Others
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving SJTU mlsys 2024 Github repo
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference Dynamic Compression NVIDIA Arxiv
Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs FP6 USYD ATC 2024
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration AWQ MIT mlsys 2024 bp
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity Univeristy of Sydney VLDB 2024 Github repo
CLLMs: Consistency Large Language Models Consistency Shanghai Jiao Tong University Arxiv Github repo
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers ETH ICLR 2023
Optimal Brain Damage(OBD) Breaking work AT&T bell NIPS 2rd
Optimal Brain Surgeon:Extensions and performance comparisons Breaking work standford NIPS 1993
WoodFisher: Efficient Second-Order Approximation for Neural Network Compression ETH NeurIPS 2020
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models MIT PMLR 2023
QuIP: 2-Bit Quantization of Large Language Models With Guarantees Cornell University NeurIPS 2023
QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks Cornell University PMLR 2024
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models VQ MSRA EMNLP 2024
GPTVQ: The Blessing of Dimensionality for LLM Quantization VQ Qualcomm AI Research ICML 2024
PQCache: Product Quantization-based KVCache for Long Context LLM Inference PQ PKU arxiv 2024
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval ANNs MSRA arxiv 2024
Transformer-VQ: Linear-Time Transformers via Vector Quantization VQ Independent Researcher ICLR 2024
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache KVCache Rice University ICML 2024
QServe:W4A8KV4QuantizationandSystemCo-designforEfficientLLMServing Algorithm and system codesign MIT Arxiv2024
QTIP: Quantization with Trellises and Incoherence Processing VQ Cornell University 2024 nips spotlight
PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression Improve STE Yandex, HSE 2024 nips oral

Communication

Paper Keywords Institute (first) Publication Others
Overlap communication with dependent compuation via Decompostion in Large Deep Learning Models Overlap Google ASPLOS 2023
Efficiently scaling Transformer inference Scaling Google Mlsys 2023
Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication communication partition PKU ASPLOS 2024

Energy

Paper Keywords Institute (first) Publication Others
Zeus: Understanding and Optimizing GPU energy Consumption of DNN Training Yale University NSDI 2023 Github repo
Power-aware Deep Learning Model Serving with μ-Serve UIUC ATC 2024
Characterizing Power Management Opportunities for LLMs in the Cloud LLM Microsoft Azure ASPLOS 2024
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency LLM Serving Cluster UIUC Arxiv

Decentralized

Paper Keywords Institute (first) Publication Others
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs Consumer-grade GPU HKBU Arxiv
Petals: Collaborative Inference and Fine-tuning of Large Models Yandex Arxiv

Serveless

Paper Keywords Institute (first) Publication Others
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models cold boot The University of Edinburgh OSDI 2024 Empty Github
StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow HUST ATC 2024 Github

Trace

Paper Keywords Institute (first) Publication Others
Characterization of Large Language Model Development in the Datacenter Cluster trace(for LLM) ShangHai AI Lab NSDI 2024 Github
BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems GPT users trace HKUSTGZ Arxiv 2024 Github
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving Disaggregated trace Moonshot AI Github
Splitwise: Efficient generative LLM inference using phase splitting Disaggregated trace UW and microsoft ISCA 2024 Github Trace