Skip to content

Official implementation of the ICLR paper "Streamlining Redundant Layers to Compress Large Language Models"

RUCKBReasoning/LLM-Streamline

Repository files navigation

🚀 ICLR 2025 Spotlight: Streamlining Redundant Layers to Compress Large Language Models

📦 Open Source Models

We have released two compressed models on Hugging Face: Llama-2-4.7B and Llama-3.1-5.4B.

📊 Evaluation Results (lm-eval)

Model arc_c arc_e boolq hellaswag openbookqa rte winogrande Avg
Llama-3.1-8B 50.4 80.3 81.2 60.2 34.8 67.9 73.0 64.0
Llama-3.1-5.4B 42.1 72.2 78.0 54.3 27.2 62.8 71.0 58.2
Llama-2-7B 43.3 76.4 77.7 57.2 31.4 62.8 69.1 59.7
Llama-2-4.7B 34.0 64.6 74.7 49.8 27.4 61.7 66.4 54.1

Model Specifications:

  • Llama-2-4.7B: Using single Transformer Layer as lightweight network, trained on 0.06B tokens.
  • Llama-3.1-5.4B: Using two Transformer Layers as lightweight network, trained on 1.3B tokens.
  • Both models trained using llm_loss.

🤖 Supported LLM Architectures

⚙️ Installation

The CUDA version we are using is 12.1.

pip install -r requirements.txt

✂️ Layer Pruning

Our code focuses on using Transformer layers as the lightweight network. The parameter weights of the first pruned layer are inherited for training, as this approach produces better results compared to using FFN or SwiGLU.

▶️ MSE Loss Training (Single GPU)

To train the lightweight network using MSE loss, execute:

python mseloss_entry.py

This training process will be executed on a single GPU. By default, Llama3.1-8B will be pruned and 8 layers will be removed from the model. All the pre-trained models and the dataset will be automatically downloaded, so you do not need to manually download the resource. When running it for the first time, it will require some time to download the model and the dataset. Please ensure that there is sufficient memory available, as all hidden states will be stored in memory. If memory is insufficient, you may modify the code to store the hidden states on the disk or utilize LLM loss for training.

▶️ LLM Loss Training (Multi-GPU)

To train the lightweight network using LLM loss under the Accelerate and DeepSpeed frameworks, execute:

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --config_file 4gpu.yaml llmloss_entry.py

This training process will be executed on 4 GPUs. Compared to mse loss, this will require more GPU memory.

We recommend using MSE Loss for training when GPU resources are limited, but when GPU resources are sufficient, using LLM loss will yield better results.

🔧 Configuration Parameters

You can change all the Arguments in the LLM_Streamline/args.py file.

  • model_name: Path to pretrained model or model identifier from huggingface.co/models
  • layer_intervals: Number of layers to prune.
  • cosine_num_data: Amount of data used to calculate cosine similarity.
  • train_num_data: Amount of data used to train the lightweight model.
  • batch_size: Batch size for training.
  • gradient_accumulation_step: Number of gradient accumulation steps during training. The effective batch size is the product of gradient_accumulation_step and batch_size.
  • epoches: Number of training epochs.
  • lr: Learning rate for training.
  • min_lr: Minimum learning rate during training.

📐 Stability Calculation

To calculate stability, execute:

python calculate_stability.py arg1 arg2

Here, arg1 refers to the model's evaluation predictions before pruning, and arg2 refers to the predictions after pruning. Both predictions are generated by OpenCompass.

For example:

arg1: "./opencompass/outputs/default/20241121_220629/predictions/llama-3-70b-hf"

arg2: "./opencompass/outputs/default/20241123_220629/predictions/llama-3-70b-hf"

About

Official implementation of the ICLR paper "Streamlining Redundant Layers to Compress Large Language Models"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages