BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers (PDF)

Abstract

The energy consumption of large-scale ML models is dominated by data movement, shuffling billions of parameters across memory hierarchies and data centers. Sparsification offers a principled way to mitigate these costs by pruning redundant weights and activations, thereby reducing data movement. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable method for sparsification, applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss (majority <2.25%). We show a 2.2x inference speedup for Llama 3.2 with 16 GPUs, and up to 4.45x reduction in inference memory footprint resulting in a 2.9x reduction in GPU setup and operating costs.

This repository contains the code for the BSpMM kernel of: "BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers"

Requirements

Hardware

We evaluate BLaST on a compute partition of the Alps supercomputer. Each node is equipped with four NVIDIA Grace Hopper GH200 superchips interconnected with HPE Cray Slinghot-11.

Software

All experiments were executed using Pytorch v2.7.0, CUDA v12.8, Megatron-core v0.10.0, Apex v0.1 and Trans- formers_engine v2.0.0.

To create the environment:

docker build -t <myorg>/<myapp>:latest .

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers (PDF)

Abstract

Requirements

Hardware

Software

About

Uh oh!

Releases

Packages

Languages

spcl/blast

Folders and files

Latest commit

History

Repository files navigation

BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers (PDF)

Abstract

Requirements

Hardware

Software

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages