Skip to content
/ blast Public

Code for BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Notifications You must be signed in to change notification settings

spcl/blast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers (PDF)

Abstract

The energy consumption of large-scale ML models is dominated by data movement, shuffling billions of parameters across memory hierarchies and data centers. Sparsification offers a principled way to mitigate these costs by pruning redundant weights and activations, thereby reducing data movement. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable method for sparsification, applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss (majority <2.25%). We show a 2.2x inference speedup for Llama 3.2 with 16 GPUs, and up to 4.45x reduction in inference memory footprint resulting in a 2.9x reduction in GPU setup and operating costs.

This repository contains the code for the BSpMM kernel of: "BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers"

Requirements

Hardware

We evaluate BLaST on a compute partition of the Alps supercomputer. Each node is equipped with four NVIDIA Grace Hopper GH200 superchips interconnected with HPE Cray Slinghot-11.

Software

All experiments were executed using Pytorch v2.7.0, CUDA v12.8, Megatron-core v0.10.0, Apex v0.1 and Trans- formers_engine v2.0.0.

To create the environment:

docker build -t <myorg>/<myapp>:latest .

About

Code for BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published