Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs #2592

danielvegamyhre · 2025-07-24T03:24:58Z

Stacked PRs:

->Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs #2592

Summary

Add "Float8BlockwiseLinear" and make it differentiable with autograd func to support training
Add Triton kernels for (1) activation quant, (2) weight quant, and (3) GEMM based on these DeepGemm inference kernel and rename GEMM to explicitly include expected scaling granularity of operands in the function names (fp8_gemm => blockwise_fp8_gemm_1x128_128x128).
Add new Triton kernel need for backward: blockwise_fp8_gemm_1x128_1x128_kernel for dW calculation where both left and right operands have activation scaling granularity (1 x block_size). This is a modified version of the kernel above, so it accepts 1x128 scaling for both operands.
GEMM kernels do accumulation in fp32 and cast output to bfloat16.
Modify all quantization kernels to use EPS to guard against NaNs upon division by 0.
Add tests verifying numerics by enforcing reasonable SQNR
Added benchmarking script for comparing Triton kernels vs FBGEMM vs DeepGEMM quantization kernels.

Why use Triton kernels instead of DeepGEMM cutlass kernels?

The GEMM APIs in @vkuzo's PoC here no longer exist in DeepGemm. I tried using the new GEMM APIs (fp8_gemm_nt etc), and:

On B200, with both Vasiliy's PR and my PR, got device-side asserts on this line, that were not immediately clear how to resolve.
On H100, I only tried Vasiliy's PR, but got undefined symbols error from CUDA, despite using CUDA toolkit 12.8+ as stated in the readme.

Since our only goal is a functional skeleton and not performance, rather than spend more time on this, I just used the existing Triton kernels we had and made a modified GEMM (1 line change) to support blockwise_fp8_gemm_1x128_1x128_kernel.

If we want to replace these Triton GEMMs with the Cutlass ones later to see if perf is better (it probably is), we can do that.

Note on numerics

Interestingly, the reference DeepGemm triton quantization kernels do NOT use EPS/clamping to prevent division by 0. This resulted in my unit tests passing (where inputs were from a normal distributed), but NaNs occuring in TorchTitan training runs, where actual activation values sometimes had amax of 0.

I updated the kernels to use the same EPS guards as torchao.float8, and this fixed the Nans.

Test plan

pytest test/prototype/blockwise_fp8/test_blockwise_kernels.py
pytest test/prototype/blockwise_fp8/test_blockwise_linear.py

Torchtitan PoC integration results

Logs showing converted model, nice decreasing loss, but very low throughput (~5.1k vs ~9.8k for bfloat16).
I think perf may be bad because of (1) all the eager .t().contiguous() type of transformations done, and (2) the DeepGemm triton GEMM requires the B tensor be in row-major, then does strided reads to get it in col-major in SMEM to do the fp8 wgmma? Same thing as with fp8 flex attention I think..

pytorch-bot · 2025-07-24T03:25:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2592

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit d0631c0 with merge base 0e00df3 ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t e503a5df82188fe6380fff510f60214b6ce0f70cab538b858498530f6382800b /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 72789b784af213b122e26e814459fef0f9df88b2557f2346cf240ae5d1c1dc06 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 67bc12085d9e5af4d856f7a55f60be2656d8cd37092355003b732f4314ed6ebb /exec failed with exit code 1
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t c3bf9f49cfae227ab80a717efcf9c0b5133a122283515633cff36da0777f2f2c /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre · 2025-07-24T04:03:49Z

cc @vkuzo @drisspg for review

drisspg · 2025-07-24T04:11:46Z

test/prototype/test_fp8_blockwise_kernels.py

+    error = torch.norm(C - C_q) / torch.norm(C)
+    print(f"Relative Error: {error.item():.6f}")
+
+    assert error < 0.1, "Quantize gemm error is too high"


Can you use sqnr everywhere match w/ existing numerics testing

Updated to use SQNR

drisspg · 2025-07-24T04:12:37Z

torchao/prototype/blockwise_fp8/kernels.py

+
+# original implementation from fbgemm_gpu:
+# https://github.com/pytorch/FBGEMM/blob/b19401e913fcdff536dc097fa3013a0a9d66256e/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py#L3091
+def triton_quantize_fp8_block(


since we have an optional runtime dependency on fbgemm can we just call their kernel directly?

Yes that is the desired end state. For now I have tried and have had repeated problems getting it to work so far (fbgemm-gpu-genai), e.g. undefined symbols. Tried on both H100 and B200 and got different undefined symbol errors

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

drisspg · 2025-07-24T04:14:14Z

(32768, 32768)  (1,128)          5732.42      7960.21        7830.56
(32768, 32768)  (128,128)       13692.4        669.664       7831.14

this number is kinda weird to me, do you have memory bandwidth calcs? I dont immediately get why there is a 10x delta in group wise vs blockwise

danielvegamyhre · 2025-07-25T14:17:57Z

this number is kinda weird to me, do you have memory bandwidth calcs? I dont immediately get why there is a 10x delta in group wise vs blockwise

Yeah I agree it's odd, will try adding some mem bw calcs, was thinking about checking with Josh / fbgemm team as well if perhapst here is a different kernel they use for activation quant.

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

deepgemm for GEMMs stack-info: PR: #2592, branch: danielvegamyhre/stack/15

…pgemm for GEMMs stack-info: PR: #2592, branch: danielvegamyhre/stack/15

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre · 2025-07-26T01:28:49Z

@drisspg @vkuzo this is ready for another look, numerics look good and torchtitan loss curve looks good. perf is bad for llama3 8b right now.

(accidentally squashed my stack into a single pr with stack-pr, sorry for the large PR, I can break it up if necessary)

danielvegamyhre added a commit that referenced this pull request Jul 24, 2025

add more generic kernel for fp8 blockwise scaling

3b36022

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from d0cd3be to 3b36022 Compare July 24, 2025 03:25

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 24, 2025

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 24, 2025

danielvegamyhre requested review from vkuzo and drisspg July 24, 2025 04:01

drisspg reviewed Jul 24, 2025

View reviewed changes

danielvegamyhre added a commit that referenced this pull request Jul 24, 2025

add more generic kernel for fp8 blockwise scaling

9821453

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 3b36022 to 9821453 Compare July 24, 2025 04:13

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 9821453 to ee6ce03 Compare July 25, 2025 03:04

danielvegamyhre mentioned this pull request Jul 25, 2025

make fp8 blockwise linear differentiable; use new kernels #2602

Closed

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from ee6ce03 to fa64d54 Compare July 25, 2025 03:11

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

add more generic kernel for fp8 blockwise scaling

9d1e13d

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from fa64d54 to 9d1e13d Compare July 25, 2025 20:05

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

add more generic kernel for fp8 blockwise scaling

766156b

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 9d1e13d to 766156b Compare July 25, 2025 20:08

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

add fp8 blockwise linear with triton kernels for quantization and

41f63f6

deepgemm for GEMMs stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 766156b to 41f63f6 Compare July 25, 2025 20:13

danielvegamyhre changed the title ~~add more generic kernel for fp8 blockwise scaling~~ add fp8 blockwise linear with triton kernels for quantization and Jul 25, 2025

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

add fp8 blockwise linear with triton kernels for quantization and dee…

b2c78e9

…pgemm for GEMMs stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 41f63f6 to b2c78e9 Compare July 25, 2025 20:16

danielvegamyhre changed the title ~~add fp8 blockwise linear with triton kernels for quantization and~~ add fp8 blockwise linear with triton kernels for quantization and deepgemm for GEMMs Jul 25, 2025

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

add fp8 blockwise linear with triton kernels for quantization and dee…

b06f818

…pgemm for GEMMs stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from b2c78e9 to b06f818 Compare July 25, 2025 22:48

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

1f06adc

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from b06f818 to 1f06adc Compare July 25, 2025 22:49

danielvegamyhre changed the title ~~add fp8 blockwise linear with triton kernels for quantization and deepgemm for GEMMs~~ Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs Jul 25, 2025

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

48a0bb9

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 1f06adc to 48a0bb9 Compare July 25, 2025 23:20

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

0ed3a77

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 48a0bb9 to 0ed3a77 Compare July 25, 2025 23:26

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

77f2c8e

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 0ed3a77 to 77f2c8e Compare July 25, 2025 23:30

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

05e1a19

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 77f2c8e to 05e1a19 Compare July 25, 2025 23:38

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

343718a

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 05e1a19 to 343718a Compare July 25, 2025 23:48

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

97cfaa4

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 343718a to 97cfaa4 Compare July 25, 2025 23:53

danielvegamyhre added a commit that referenced this pull request Jul 25, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

44448c1

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 97cfaa4 to 44448c1 Compare July 25, 2025 23:56

danielvegamyhre added a commit that referenced this pull request Jul 26, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

a151e46

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 44448c1 to a151e46 Compare July 26, 2025 00:58

danielvegamyhre added a commit that referenced this pull request Jul 26, 2025

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

0c2d688

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from a151e46 to 0c2d688 Compare July 26, 2025 01:06

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs

d0631c0

stack-info: PR: #2592, branch: danielvegamyhre/stack/15

danielvegamyhre force-pushed the danielvegamyhre/stack/15 branch from 0c2d688 to d0631c0 Compare July 26, 2025 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs #2592

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs #2592

Uh oh!

danielvegamyhre commented Jul 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jul 24, 2025

Uh oh!

drisspg Jul 24, 2025

Uh oh!

danielvegamyhre Jul 25, 2025

Uh oh!

drisspg Jul 24, 2025 •

edited

Loading

Uh oh!

danielvegamyhre Jul 25, 2025 •

edited

Loading

Uh oh!

drisspg commented Jul 24, 2025

Uh oh!

danielvegamyhre commented Jul 25, 2025

Uh oh!

danielvegamyhre commented Jul 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs #2592

Are you sure you want to change the base?

Add Float8BlockwiseLinear with Triton kernels for quantization and GEMMs #2592

Uh oh!

Conversation

danielvegamyhre commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why use Triton kernels instead of DeepGEMM cutlass kernels?

Note on numerics

Test plan

Torchtitan PoC integration results

Uh oh!

pytorch-bot bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2592

❌ 4 New Failures

Uh oh!

danielvegamyhre commented Jul 24, 2025

Uh oh!

drisspg Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg commented Jul 24, 2025

Uh oh!

danielvegamyhre commented Jul 25, 2025

Uh oh!

danielvegamyhre commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Jul 24, 2025 •

edited

Loading

pytorch-bot bot commented Jul 24, 2025 •

edited

Loading

drisspg Jul 24, 2025 •

edited

Loading

danielvegamyhre Jul 25, 2025 •

edited

Loading

danielvegamyhre commented Jul 26, 2025 •

edited

Loading