Skip to content

Latest commit

 

History

History
368 lines (332 loc) · 40.8 KB

File metadata and controls

368 lines (332 loc) · 40.8 KB

Awesome Resource-Efficient LLM Papers Awesome

A curated list of high-quality papers on resource-efficient LLMs.
Clean Energy GIF

This is the GitHub repo for our survey paper Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models.

Table of Contents

LLM Architecture Design

Efficient Transformer Architecture

Date Keywords Paper Venue
2024 Approximate attention Simple linear attention language models balance the recall-throughput tradeoff ArXiv
2024 Hardware attention MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases ArXiv
2024 Approximate attention LoMA: Lossless Compressed Memory Attention ArXiv
2024 Approximate attention Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation ICML
2024 Hardware optimization FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning ICLR
2023 Hardware optimization Flashattention: Fast and memory-efficient exact attention with io-awareness NeurIPS
2023 Approximate attention KDEformer: Accelerating Transformers via Kernel Density Estimation ICML
2023 Approximate attention Mega: Moving Average Equipped Gated Attention ICLR
2022 Hardware optimization xFormers - Toolbox to Accelerate Research on Transformers GitHub
2021 Approximate attention Efficient attention: Attention with linear complexities WACV
2021 Approximate attention An Attention Free Transformer ArXiv
2021 Approximate attention Self-attention Does Not Need O(n^2) Memory ArXiv
2021 Hardware optimization LightSeq: A High Performance Inference Library for Transformers NAACL
2021 Hardware optimization FasterTransformer: A Faster Transformer Framework GitHub
2020 Approximate attention Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention ICML
2019 Approximate attention Reformer: The efficient transformer ICLR

Non-transformer Architecture

Date Keywords Paper Venue
2024 Decoder You Only Cache Once: Decoder-Decoder Architectures for Language Models ArXiv
2024 BitLinear layer Scalable MatMul-free Language Modeling ArXiv
2023 RNN LM RWKV: Reinventing RNNs for the Transformer Era EMNLP-Findings
2023 MLP Auto-Regressive Next-Token Predictors are Universal Learners ArXiv
2023 Convolutional LM Hyena Hierarchy: Towards Larger Convolutional Language models ICML
2023 Sub-quadratic Matrices based Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture NeurIPS
2023 Selective State Space Model Mamba: Linear-Time Sequence Modeling with Selective State Spaces ArXiv
2022 Mixture of Experts Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity JMLR
2022 Mixture of Experts GLaM: Efficient Scaling of Language Models with Mixture-of-Experts ICML
2022 Mixture of Experts Mixture-of-Experts with Expert Choice Routing NeurIPS
2022 Mixture of Experts Efficient Large Scale Language Modeling with Mixtures of Experts EMNLP
2017 Mixture of Experts Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer ICLR

LLM Pre-Training

Memory Efficiency

Distributed Training

Date Keywords Paper Venue
2024 Model Parallelism ProTrain: Efficient LLM Training via Adaptive Memory Management Arxiv
2024 Model Parallelism MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs Arxiv
2023 Data Parallelism Palm: Scaling language modeling with pathways Github
2023 Model Parallelism Bpipe: memory-balanced pipeline parallelism for training large language models JMLR
2022 Model Parallelism Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI
2021 Data Parallelism FairScale: A general purpose modular PyTorch library for high performance and large scale training JMLR
2020 Data Parallelism Zero: Memory optimizations toward training trillion parameter models IEEE SC20
2019 Model Parallelism GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism NeurIPS
2019 Model Parallelism Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Arxiv
2019 Model Parallelism PipeDream: generalized pipeline parallelism for DNN training SOSP
2018 Model Parallelism Mesh-tensorflow: Deep learning for supercomputers NeurIPS

Mixed precision training

Date Keywords Paper Venue
2024 Mixed Precision Training FP8-LM: Training FP8 Large Language Models Arxiv
2022 Mixed Precision Training BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Arxiv
2018 Mixed Precision Training Bert: Pre-training of deep bidirectional transformers for language understanding ACL
2017 Mixed Precision Training Mixed Precision Training ICLR

Data Efficiency

Importance Sampling

Date Keywords Paper Venue
2024 Importance sampling How to Train Data-Efficient LLMs Arxiv
2024 Importance sampling LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning Arxiv
2023 Survey on importance sampling A Survey on Efficient Training of Transformers IJCAI
2023 Importance sampling Data-Juicer: A One-Stop Data Processing System for Large Language Models Arxiv
2023 Importance sampling INGENIOUS: Using Informative Data Subsets for Efficient Pre-Training of Language Models EMNLP
2023 Importance sampling Machine Learning Force Fields with Data Cost Aware Training ICML
2022 Importance sampling Beyond neural scaling laws: beating power law scaling via data pruning NeurIPS
2021 Importance sampling Deep Learning on a Data Diet: Finding Important Examples Early in Training NeurIPS
2018 Importance sampling Training Deep Models Faster with Robust, Approximate Importance Sampling NeurIPS
2018 Importance sampling Not All Samples Are Created Equal: Deep Learning with Importance Sampling ICML

Data Augmentation

Date Keywords Paper Venue
2024 Data Augmentation LLMRec: Large Language Models with Graph Augmentation for Recommendation WSDM
2024 Data augmentation LLM-DA: Data Augmentation via Large Language Models for Few-Shot Named Entity Recognition Arxiv
2023 Data augmentation MixGen: A New Multi-Modal Data Augmentation WACV
2023 Data augmentation Augmentation-Aware Self-Supervision for Data-Efficient GAN Training NeurIPS
2023 Data augmentation Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis EMNLP
2023 Data augmentation FaMeSumm: Investigating and Improving Faithfulness of Medical Summarization EMNLP

Training Objective

Date Keywords Paper Venue
2023 Training objective Challenges and Applications of Large Language Models Arxiv
2023 Training objective Efficient Data Learning for Open Information Extraction with Pre-trained Language Models EMNLP
2023 Masked language-image modeling Scaling Language-Image Pre-training via Masking CVPR
2022 Masked image modeling Masked Autoencoders Are Scalable Vision Learners CVPR
2019 Masked language modeling MASS: Masked Sequence to Sequence Pre-training for Language Generation ICML

LLM Fine-Tuning

Parameter-Efficient Fine-Tuning

Date Keywords Paper Venue
2024 LoRA-based fine-tuning Dlora: Distributed parameter-efficient fine-tuning solution for large language model Arxiv
2024 LoRA-based fine-tuning SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models Arxiv
2024 LoRA-based fine-tuning Data-efficient Fine-tuning for LLM-based Recommendation SIGIR
2024 LoRA-based fine-tuning MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter ACL
2023 LoRA-based fine-tuning DyLoRA: Parameter-Efficient Tuning of Pretrained Models using Dynamic Search-Free Low Rank Adaptation EACL
2022 Masking-based fine-tuning Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively NeurIPS
2021 Masking-based fine-tuning BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models ACL
2021 Masking-based fine-tuning Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning EMNLP
2021 Masking-based fine-tuning Unlearning Bias in Language Models by Partitioning Gradients ACL
2019 Masking-based fine-tuning SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization ACL

Full-Parameter Fine-Tuning

Date Keywords Paper Venue
2024 Full-parameter fine-tuning Hift: A hierarchical full parameter fine-tuning strategy Arxiv
2024 Study of full-parameter fine-tuning optimizations A Study of Optimizations for Fine-tuning Large Language Models Arxiv
2023 Comparative study betweeen full-parameter and LoRA-base fine-tuning A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model Arxiv
2023 Comparative study betweeen full-parameter and parameter-efficient fine-tuning Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification Arxiv
2023 Full-parameter fine-tuning with limited resources Full Parameter Fine-tuning for Large Language Models with Limited Resources Arxiv
2023 Memory-efficient fine-tuning Fine-Tuning Language Models with Just Forward Passes NeurIPS
2023 Full-parameter fine-tuning for medicine applications PMC-LLaMA: Towards Building Open-source Language Models for Medicine Arxiv
2022 Drawback of full-parameter fine-tuning Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution ICLR

LLM Inference

Model Compression

Pruning

Date Keywords Paper Venue
2024 Unstructured Pruning SparseLLM: Towards Global Pruning for Pre-trained Language Models NeurIPS
2024 Structured Pruning Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Arxiv
2024 Structured Pruning BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation Arxiv
2024 Structured Pruning ShortGPT: Layers in Large Language Models are More Redundant Than You Expect Arxiv
2024 Structured Pruning NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models Arxiv
2024 Structured Pruning SliceGPT: Compress Large Language Models by Deleting Rows and Columns ICLR
2024 Unstructured Pruning Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs ICLR
2024 Structured Pruning Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models ICLR
2023 Unstructured Pruning One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models Arxiv
2023 Unstructured Pruning SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot ICML
2023 Unstructured Pruning A Simple and Effective Pruning Approach for Large Language Models ICLR
2023 Unstructured Pruning AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference With Transformers TCAD
2023 Structured Pruning LLM-Pruner: On the Structural Pruning of Large Language Models NeurIPS
2023 Structured Pruning LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation ICML
2023 Structured Pruning Structured Pruning for Efficient Generative Pre-trained Language Models ACL
2023 Structured Pruning ZipLM: Inference-Aware Structured Pruning of Language Models NeurIPS
2023 Contextual Pruning Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time ICML

Quantization

Date Keywords Paper Venue
2024 Weight Quantization Evaluating Quantized Large Language Models Arxiv
2024 Weight Quantization I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models Arxiv
2024 Weight Quantization ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models Arxiv
2024 Weight-Activation Co-Quantization Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs NeurIPS
2024 Weight Quantization OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models ICLR
2023 Weight Quantization Flexround: Learnable rounding based on element-wise division for post-training quantization ICML
2023 Weight Quantization Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling EMNLP
2023 Weight Quantization OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models AAAI
2023 Weight Quantization Gptq: Accurate posttraining quantization for generative pre-trained transformers ICLR
2023 Weight Quantization Dynamic Stashing Quantization for Efficient Transformer Training EMNLP
2023 Weight Quantization Quantization-aware and tensor-compressed training of transformers for natural language understanding Interspeech
2023 Weight Quantization QLoRA: Efficient Finetuning of Quantized LLMs NeurIPS
2023 Weight Quantization Stable and low-precision training for large-scale vision-language models NeurIPS
2023 Weight Quantization Prequant: A task-agnostic quantization approach for pre-trained language models ACL
2023 Weight Quantization Olive: Accelerating large language models via hardware-friendly outliervictim pair quantization ISCA
2023 Weight Quantization Awq: Activationaware weight quantization for llm compression and acceleration arXiv
2023 Weight Quantization Spqr: A sparsequantized representation for near-lossless llm weight compression arXiv
2023 Weight Quantization SqueezeLLM: Dense-and-Sparse Quantization arXiv
2023 Weight Quantization LLM-QAT: Data-Free Quantization Aware Training for Large Language Models arXiv
2022 Activation Quantization Gact: Activation compressed training for generic network architectures ICML
2022 Fixed-point Quantization Boost Vision Transformer with GPU-Friendly Sparsity and Quantization ACL
2021 Activation Quantization Ac-gc: Lossy activation compression with guaranteed convergence NeurIPS

Dynamic Acceleration

Input Pruning

Date Keywords Paper Venue
2024 Score-based Token Removal Prompt-prompted Adaptive Structured Pruning for Efficient LLM Generation COLM
2024 Score-based Token Removal LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference Arxiv
2024 Learning-based Token Removal LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression ACL
2024 Learning-based Token Removal Compressed Context Memory For Online Language Model Interaction ICLR
2023 Score-based Token Removal Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference KDD
2023 Learning-based Token Removal PuMer: Pruning and Merging Tokens for Efficient Vision Language Models ACL
2023 Learning-based Token Removal Infor-Coef: Information Bottleneck-based Dynamic Token Downsampling for Compact and Efficient language model arXiv
2023 Learning-based Token Removal SmartTrim: Adaptive Tokens and Parameters Pruning for Efficient Vision-Language Models arXiv
2022 Learning-based Token Removal Transkimmer: Transformer Learns to Layer-wise Skim ACL
2022 Score-based Token Removal Learned Token Pruning for Transformers KDD
2021 Learning-based Token Removal TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference NAACL
2021 Score-based Token Removal Efficient sparse attention architecture with cascade token and head pruning HPCA

System Design

Deployment optimization

Date Keywords Paper Venue
2024 Hardware optimization LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System ArXiv
2024 Hardware Optimization LUT TENSOR CORE: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration Arxiv
2023 Hardware offloading FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU PMLR
2023 Hardware offloading Fast distributed inference serving for large language models arXiv
2022 Collaborative inference Petals: Collaborative Inference and Fine-tuning of Large Models arXiv
2022 Hardware offloading DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale IEEE SC22

Support Infrastructure

Date Keywords Paper Venue
2024 Edge devices MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases ICML
2024 Edge devices EdgeShard: Efficient LLM Inference via Collaborative Edge Computing Arxiv
2024 Edge devices Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs ICML
2024 Edge devices The breakthrough memory solutions for improved performance on llm inference IEEE Micro
2024 Edge devices MELTing point: Mobile Evaluation of Language Transformers MobiCom
2024 Edge devices LLM as a System Service on Mobile Devices Arxiv
2024 Edge devices LocMoE: A Low-overhead MoE for Large Language Model Training Arxiv
2024 Edge devices Jetmoe: Reaching llama2 performance with 0.1 m dollars Arxiv
2023 Edge devices Training Large-Vocabulary Neural Language Models by Private Federated Learning for Resource-Constrained Devices ICASSP
2023 Edge devices Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly arXiv
2023 Libraries Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP
2023 Libraries GPT-NeoX-20B: An Open-Source Autoregressive Language Model ACL
2023 Edge devices Large Language Models Empowered Autonomous Edge AI for Connected Intelligence arXiv
2022 Libraries DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale IEEE SC22
2022 Libraries Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI
2022 Edge devices EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation arXiv
2022 Edge devices ProFormer: Towards On-Device LSH Projection-Based Transformers ACL
2021 Edge devices Generate More Features with Cheap Operations for BERT ACL
2021 Edge devices SqueezeBERT: What can computer vision teach NLP about efficient neural networks? SustaiNLP
2020 Edge devices Lite Transformer with Long-Short Range Attention arXiv
2019 Libraries Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism IEEE SC22
2018 Libraries Mesh-TensorFlow: Deep Learning for Supercomputers NeurIPS

Other Systems

Date Keywords Paper Venue
2023 Other Systems Tabi: An Efficient Multi-Level Inference System for Large Language Models EuroSys
2023 Other Systems Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation PACMMOD

Resource-Efficiency Evaluation Metrics & Benchmarks

🧮 Computation Metrics

Metric Description Example Usage
FLOPs (Floating-point operations) the number of arithmetic operations on floating-point numbers [FLOPs]
Training Time the total duration required for training, typically measured in wall-clock minutes, hours, or days [minutes, days]
[hours]
Inference Time/Latency the average time required generate an output after receiving an input, typically measured in wall-clock time or CPU/GPU/TPU clock time in milliseconds or seconds [end-to-end latency in seconds]
[next token generation latency in milliseconds]
Throughput the rate of output tokens generation or tasks completion, typically measured in tokens per second (TPS) or queries per second (QPS) [tokens/s]
[queries/s]
Speed-Up Ratio the improvement in inference speed compared to a baseline model [inference time speed-up]
[throughput speed-up]

💾 Memory Metrics

Metric Description Example Usage
Number of Parameters the number of adjustable variables in the LLM’s neural network [number of parameters]
Model Size the storage space required for storing the entire model [peak memory usage in GB]

⚡️ Energy Metrics

Metric Description Example Usage
Energy Consumption the electrical power used during the LLM’s lifecycle [kWh]
Carbon Emission the greenhouse gas emissions associated with the model’s energy usage [kgCO2eq]

The following are available software packages designed for real-time tracking of energy consumption and carbon emission.

You might also find the following helpful for predicting the energy usage and carbon footprint before actual training or

💵 Financial Cost Metric

Metric Description Example Usage
Dollars per parameter the total cost of training (or running) the LLM by the number of parameters

📨 Network Communication Metric

Metric Description Example Usage
Communication Volume the total amount of data transmitted across the network during a specific LLM execution or training run [communication volume in TB]

💡 Other Metrics

Metric Description Example Usage
Compression Ratio the reduction in size of the compressed model compared to the original model [compress rate]
[percentage of weights remaining]
Loyalty/Fidelity the resemblance between the teacher and student models in terms of both predictions consistency and predicted probability distributions alignment [loyalty]
[fidelity]
Robustness the resistance to adversarial attacks, where slight input modifications can potentially manipulate the model's output [after-attack accuracy, query number]
Pareto Optimality the optimal trade-offs between various competing factors [Pareto frontier (cost and accuracy)]
[Pareto frontier (performance and FLOPs)]

Benchmarks

Benchmark Description Paper
General NLP Benchmarks an extensive collection of general NLP benchmarks such as GLUE, SuperGLUE, WMT, and SQuAD, etc. A Comprehensive Overview of Large Language Models
Dynaboard an open-source platform for evaluating NLP models in the cloud, offering real-time interaction and a holistic assessment of model quality with customizable Dynascore Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking
EfficientQA an open-domain Question Answering (QA) challenge at NeurIPS 2020 that focuses on building accurate, memory-efficient QA systems NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned
SustaiNLP 2020 Shared Task a challenge for development of energy-efficient NLP models by assessing their performance across eight NLU tasks using SuperGLUE metrics and evaluating their energy consumption during inference Overview of the SustaiNLP 2020 Shared Task
ELUE (Efficient Language Understanding Evaluation) a benchmark platform for evaluating NLP model efficiency across various tasks, offering online metrics and requiring only a Python model definition file for submission Towards Efficient NLP: A Standard Evaluation and A Strong Baseline
VLUE (Vision-Language Understanding Evaluation) a comprehensive benchmark for assessing vision-language models across multiple tasks, offering an online platform for evaluation and comparison VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Long Range Arena (LAG) a benchmark suite evaluating efficient Transformer models on long-context tasks, spanning diverse modalities and reasoning types while allowing evaluations under controlled resource constraints, highlighting real-world efficiency Long Range Arena: A Benchmark for Efficient Transformers
Efficiency-aware MS MARCO an enhanced MS MARCO information retrieval benchmark that integrates efficiency metrics like per-query latency and cost alongside accuracy, facilitating a comprehensive evaluation of IR systems Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

Reference

If you find this paper list useful in your research, please consider citing:

@article{bai2024beyond,
  title={Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models},
  author={Bai, Guangji and Chai, Zheng and Ling, Chen and Wang, Shiyu and Lu, Jiaying and Zhang, Nan and Shi, Tingwei and Yu, Ziyang and Zhu, Mengdan and Zhang, Yifei and others},
  journal={arXiv preprint arXiv:2401.00625},
  year={2024}
}