Implement fused modules #747

casper-hansen · 2023-10-19T20:57:14Z

There are two common ways to fuse layers in Llama/Mistral type of models. Speed and memory are measured on RTX 3090 with TinyLlama 1B.

MLP: Merge gate_proj and up_proj together
- Handle splitting SwiGLU weights into original module post-training
Attention: Merge q, k, v together
- Handle replacing attention module better, currently only works with sample packing off
- Fix the loss being much higher

All fusing of layers must happen AFTER the model is loaded in order to load the pretrained weights into the fused modules.

TinyLlama 1.1B - A6000

Conclusion: Fusing MLP can save roughly 27% memory in cache. Fusing attention seems to do nothing for the speed but increases memory with about 1GB.

None fused (Main):
- Memory: 6.186GB (+15.724GB cache, +0.781GB misc)
- Speed: 0.53 seconds per step
MLP fused (PR):
- Memory: 6.188GB (+11.443GB cache, +0.781GB misc)
- Speed: 0.51 seconds per step
MLP + Attention fused (PR):
- Memory: 6.618GB (+12.195GB cache, +0.781GB misc)
- Speed: 0.50 seconds per step

Llama-2-7B - A100

Conclusion: Saves enough memory to load using adamw_torch.

None fused (main):
- adamw_torch Memory: 37.732GB (+39.764GB cache, +1.366GB misc)
- adamw_torch_fused Memory: 37.732GB (+14.506GB cache, +1.366GB misc)
- adamw_bnb_8bit Memory: 25.393GB (+14.494GB cache, +1.366GB misc)
MLP fused (PR):
- adamw_torch Memory: 37.732GB (+38.813GB cache, +1.366GB misc)
- adamw_torch_fused Memory: 37.732GB (+14.647GB cache, +1.366GB misc)
- adamw_bnb_8bit Memory: 25.269GB (+14.137GB cache, +1.366GB misc)
MLP + Attention fused (PR):
- adamw_bnb_8bit Memory: 31.332GB (+13.752GB cache, +1.366GB misc)
- adamw_torch Memory: OOM

QLoRA

Currently, it is not compatible with QLoRA. But there is potential to do so. In bitsandbytes, you can import the 4-bit and 8-bit linears and use them instead of nn.Linear.

https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/nn/modules.py#L258

* MLP: Memory saving * Remove RMSNorm restrictions * Map packed weights to original * FusedAttention module * Simplify code * Move fused modules * Fix critical typo * Split inplace * Add FFT config * Add validation of fused arguments * Add fused arguments to config * Update docs * Fix validation logic * Add fused modules to flash attn * Only fuse during training * Remove timing * Formatting * Formatting * Formatting * chore: lint * chore: lint * add e2e tests for fused llama * no lora for tests --------- Co-authored-by: Wing Lian <[email protected]>

casper-hansen added 19 commits October 19, 2023 22:51

MLP: Memory saving

cd1853e

Remove RMSNorm restrictions

fa635c6

Map packed weights to original

ea409f9

FusedAttention module

2ea0472

Simplify code

ed26619

Move fused modules

4b8ca18

Fix critical typo

f4fc8e1

Split inplace

5fd21f6

Add FFT config

2ca2e1e

Add validation of fused arguments

6a5ac7f

Add fused arguments to config

856cec4

Update docs

aa7176b

Fix validation logic

5fc8655

Add fused modules to flash attn

c91dc99

Only fuse during training

5550871

Remove timing

7a6bd2b

Formatting

fac3ed1

Formatting

0272261

Formatting

779187c

casper-hansen marked this pull request as ready for review October 20, 2023 18:55

winglian added 4 commits October 20, 2023 16:43

chore: lint

7c9c535

chore: lint

8e9a2f4

add e2e tests for fused llama

fb07f6f

no lora for tests

3e34c4c

winglian merged commit 15d3a65 into axolotl-ai-cloud:main Oct 21, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fused modules #747

Implement fused modules #747

casper-hansen commented Oct 19, 2023 •

edited

Loading

Implement fused modules #747

Implement fused modules #747

Conversation

casper-hansen commented Oct 19, 2023 • edited Loading

TinyLlama 1.1B - A6000

Llama-2-7B - A100

QLoRA

casper-hansen commented Oct 19, 2023 •

edited

Loading