description |
---|
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on LLMs inference and serving. |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Full Stack Optimization for Transformer Inference: a Survey | Hardware and software co-design | UCB | Arxiv | |
A survey of techniques for optimizing transformer inference | Transformer optimization | Iowa State Univeristy | Journal of Systems Architecture | |
A Survey on Model Compression for Large Language Models | Model Compression | UCSD | Arxiv | |
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | Optimization technique: quant, pruning, continuous batching, virtual memory | CMU | Arxiv | |
LLM Inference Unveiled: Survey and Roofline Model Insights | Performance analysis | Infinigence-AI | Arxiv | LLMViewer |
LLM Inference Serving: Survey of Recent Advances and Opportunities | Northeastern University | Arxiv | ||
Efficient Large Language Models: A Survey | The Ohio State University | Transactions on Machine Learning Research |
Paper | Keywords | Institute(first) | Publication | Others |
---|---|---|---|---|
AIOS: LLM Agent Operating System | OS; LLM Agent | Rutgers University | Arxiv |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Overlap communication with dependent compuation via Decompostion in Large Deep Learning Models | Overlap | ASPLOS 2023 | ||
Efficiently scaling Transformer inference | Scaling | Mlsys 2023 | ||
Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication | communication partition | PKU | ASPLOS 2024 |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Zeus: Understanding and Optimizing GPU energy Consumption of DNN Training | Yale University | NSDI 2023 | Github repo | |
Power-aware Deep Learning Model Serving with μ-Serve | UIUC | ATC 2024 | ||
Characterizing Power Management Opportunities for LLMs in the Cloud | LLM | Microsoft Azure | ASPLOS 2024 | |
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency | LLM Serving Cluster | UIUC | Arxiv |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs | Consumer-grade GPU | HKBU | Arxiv | |
Petals: Collaborative Inference and Fine-tuning of Large Models | Yandex | Arxiv |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models | cold boot | The University of Edinburgh | OSDI 2024 | Empty Github |
StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow | HUST | ATC 2024 | Github |
Paper | Keywords | Institute (first) | Publication | Others |
---|---|---|---|---|
Characterization of Large Language Model Development in the Datacenter | Cluster trace(for LLM) | ShangHai AI Lab | NSDI 2024 | Github |
BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems | GPT users trace | HKUSTGZ | Arxiv 2024 | Github |
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | Disaggregated trace | Moonshot AI | Github | |
Splitwise: Efficient generative LLM inference using phase splitting | Disaggregated trace | UW and microsoft | ISCA 2024 | Github Trace |