Skip to content

IST-DASLab/EvoPress

Repository files navigation

EvoPress

Code for paper EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search.

Usage

Repository structure


  • scripts/ — contains bash scripts with the required arguments to run the method
  • src/ — directory for helper methods and utility functions
  • evo_drop_search.py — evolutionary depth pruning
  • drop_scoring.py — scoring based baseline methods for depth pruning
  • brute_force_drop.py — brute force depth pruning
  • evo_prune_search.py — evolutionary unstructured sparsity allocation
  • prune.py — SparseGPT unstructured pruning (preparation of database for EvoPress)
  • owl_prune.py — SparseGPT unstructured pruning (preparation of database for OWL)
  • evo_quant_search.py — evolutionary quantization bitwidth allocation
  • quant.py — GPTQ quantization (preparation of database for EvoPress)
  • compute_layer_errors.py — compute NMSE for Dynamic Programming (DP) solver
  • dp_search.py — script to run DP solver on top of configuration produced by compute_layer_errors.py
  • lmeval.py — LM Eval Harness evalution script
  • eval_ppl.py — perplexity evalution script

Calibration data


We provide 3 options for calibration data: wikitext2, c4, fineweb_edu. We recommend using the latter one for the best results. In our experiments we used 8M tokens for calibration. To prepare a specific amount of calibration data specify --calibration_tokens. By default we trim the calibration sequence length to the maximal context length. However, for some models, context length may be too long to fit into memory. We set --calibration_sequence_length to 8k for models with context length >=8k.

In experiments we used --calibration_tokens=2^23and --calibration_sequence_length=8192 for Llama-3-8B, Llama-3.1-8B, Phi-3-medium-128k-instruct, and --calibration_sequence_length=4096 for Llama-2-7b.

Multi-GPU


Some of the scripts (Unstructured Sparsity, Quantization) may operate in distributed mode for faster execution. We recommend using torchrun to launch them:

torchrun --nnodes=1 --nproc-per-node=<NUM_GPU> <name_of_the_script.py> <args...>

Depth pruning


We provide 3 versions for depth pruning:

  • evo_drop_search.py — depth pruning via EvoPress
  • drop_scoring.py — depth pruning via scoring methods
  • brute_force_drop.py — depth pruning via brute force

To run EvoPress for depth pruning, execute run_drop_search.sh in the scripts folder.

Unstructured Sparsity


We provide 2 version for unstructured pruning:

  • prune.py — SparseGPT unstructured pruning (preparation of database for EvoPress)
  • owl_prune.py — SparseGPT unstructured pruning (preparation of database for OWL)

To run EvoPress for non-uniform unstructured pruning, first execute run_sparse_gpt.sh to generate the database and then run_prune_search.sh for the search.

Quantization


We provide quant.py for producing the GPTQ database for EvoPress.

To run EvoPress for non-uniform quantization, first execute run_gptq.sh to generate the database and then run_quant_search.sh for the search.

Evaluation


We provide lmeval.py and eval_ppl.py scripts for evaluation on Language Model Evaluation Harness benchmarks and perplexity measurements. The interface of lmeval.py mostly follows the instructions from the original. In addition, one should specify the path to sparse/quantized weights via --sparse-weights-path/--quant-weights-path argument and path to .txt with chosen compression levels via --sparse-config-path/--quant-config-path argument. We adopted lm-eval==0.4.0 for evaluation.

Environment

This code was tested on the following environment:

pytorch                   2.4.0           py3.10_cuda12.1_cudnn9.1.0_0    pytorch
pytorch-cuda              12.1                 ha16c6d3_5    pytorch
cuda                      12.1.0                        0    nvidia/label/cuda-12.1.0
transformers              4.43.4                   pypi_0    pypi
datasets                  2.21.0                   pypi_0    pypi
lm-eval                   0.4.0                    pypi_0    pypi

Notes

Scripts prune.py, owl_prune.py, quant.py produce several versions of compressed representation for each weight (100-200 Gb). Make sure that you have sufficient amount of free space on drive before running. Additionally, when using KL-Divergence as the fitness function for the search, ensure you have enough RAM to store the logits, particularly for the models with 128K vocabulary size. Alternatively, we implemented TopK-KL-Divergence in evo_quant_search.py, which significantly reduces memory requirements. Preliminary experiments have shown this method to be comparably effective to KL-Divergence for $K \geq 512$.