[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations #10867

SageMoore · 2024-12-03T16:49:39Z

Credit to @LucasWilkinson for the kernel.

This pass currently only supports static per-tensor quantization. Other quantization schemes will be included in a subsequent PRs.

I've attached some QPS sweeps that were run using neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 on an H100. Generally speaking, this pass improves the TPOT of FP8 Llama by 2-3%. There are similar improvements with TTFT with the exception of 20QPS which is much (~2x) faster.

fused_results
torch_compile_results

github-actions · 2024-12-03T16:49:52Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Sage Moore <[email protected]>

vllm/compilation/activation_quant_fusion.py

tests/kernels/test_fused_quant_activation.py

csrc/torch_bindings.cpp

tlrmchlsmth

Focused on csrc/quantization/activation_kernels.cu. spotted a couple of potential int32_t overflows

csrc/core/math.hpp

csrc/quantization/activation_kernels.cu

tlrmchlsmth

A couple more comments - LGTM if we can support non power of two hidden sizes

tests/compile/test_functionalization.py

tests/kernels/test_fused_quant_activation.py

csrc/quantization/activation_kernels.cu

mergify · 2024-12-19T16:56:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SageMoore.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…silu-mul-quant

Signed-off-by: Sage Moore <[email protected]>

SageMoore · 2024-12-19T17:34:33Z

Apologies for the noise. I accidentally added my signature to a bunch of irrelevant commits which pulled them into the PR temporarily. Things should be sorted now.

Signed-off-by: Sage Moore <[email protected]>

csrc/quantization/activation_kernels.cu

ProExpertProg · 2024-12-20T15:03:23Z

csrc/quantization/activation_kernels.cu

+}  // namespace vllm
+
+// Launch activation, gating, and quantize kernel.
+#define LAUNCH_ACTIVATION_GATE_KERNEL(KERNEL)                           \


Is there a reason this needs a macro?

I just copied what the existing act_and_mul kernel does. This allows us to just drop in kernels for the other activation functions. I'm in favor of keeping it.

tests/compile/test_functionalization.py

youkaichao

glad to see more fusion passes, I will hand it over to @tlrmchlsmth and @ProExpertProg for detailed review.

Signed-off-by: Sage Moore <[email protected]>

…silu-mul-quant

mergify · 2025-01-21T19:55:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @SageMoore.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the ci/build label Dec 3, 2024

SageMoore added 17 commits December 6, 2024 20:33

init

2e0031a

Signed-off-by: Sage Moore <[email protected]>

remove backend format changes

8a957c7

Signed-off-by: Sage Moore <[email protected]>

format

2913716

Signed-off-by: Sage Moore <[email protected]>

move activation_quant_kernels to the quantization dir

11c6fae

Signed-off-by: Sage Moore <[email protected]>

added replacement unit test

2dfecb5

Signed-off-by: Sage Moore <[email protected]>

added kernel unit test

702fa46

Signed-off-by: Sage Moore <[email protected]>

misc cleanup

583ff4c

Signed-off-by: Sage Moore <[email protected]>

move activation quant fusion to its own pass

e5680f7

Signed-off-by: Sage Moore <[email protected]>

update test

4b775c4

Signed-off-by: Sage Moore <[email protected]>

format

d5ff865

Signed-off-by: Sage Moore <[email protected]>

format

c970dec

Signed-off-by: Sage Moore <[email protected]>

format

596c445

Signed-off-by: Sage Moore <[email protected]>

format

7ab3e18

Signed-off-by: Sage Moore <[email protected]>

format

d347431

Signed-off-by: Sage Moore <[email protected]>

format

553d99c

Signed-off-by: Sage Moore <[email protected]>

format

774559d

Signed-off-by: Sage Moore <[email protected]>

format

e2fda7f

Signed-off-by: Sage Moore <[email protected]>

SageMoore force-pushed the sage/silu-mul-quant branch from 27be0bd to e2fda7f Compare December 6, 2024 20:34

SageMoore marked this pull request as ready for review December 6, 2024 20:36

SageMoore requested review from tlrmchlsmth and WoosukKwon as code owners December 6, 2024 20:36

SageMoore added 2 commits December 9, 2024 18:12

minor comment fix

6915fa2

Signed-off-by: Sage Moore <[email protected]>

minor updates

6d4b8d0

Signed-off-by: Sage Moore <[email protected]>

bnellnm reviewed Dec 10, 2024

View reviewed changes

vllm/compilation/activation_quant_fusion.py Outdated Show resolved Hide resolved

bnellnm reviewed Dec 10, 2024

View reviewed changes

tests/kernels/test_fused_quant_activation.py Show resolved Hide resolved

ProExpertProg reviewed Dec 12, 2024

View reviewed changes

csrc/torch_bindings.cpp Show resolved Hide resolved

tlrmchlsmth reviewed Dec 12, 2024

View reviewed changes

csrc/core/math.hpp Outdated Show resolved Hide resolved

csrc/quantization/activation_kernels.cu Outdated Show resolved Hide resolved

csrc/quantization/activation_kernels.cu Show resolved Hide resolved

csrc/quantization/activation_kernels.cu Show resolved Hide resolved

tlrmchlsmth mentioned this pull request Dec 12, 2024

[Kernel]: Cutlass 2:4 Sparsity + FP8/Int8 Quant Support #10995

Merged

mergify bot added the needs-rebase label Dec 18, 2024

tlrmchlsmth approved these changes Dec 18, 2024

View reviewed changes

mergify bot removed the needs-rebase label Dec 19, 2024

SageMoore force-pushed the sage/silu-mul-quant branch 2 times, most recently from 21c0f3d to 70bb71b Compare December 19, 2024 16:56

SageMoore requested review from mgoin, youkaichao, robertgshaw2-redhat, DarkLight1337, ywang96, simon-mo, njhill, comaniac and alexm-redhat as code owners December 19, 2024 16:56

mergify bot added the frontend label Dec 19, 2024

mergify bot added the needs-rebase label Dec 19, 2024

SageMoore added 2 commits December 19, 2024 17:12

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

bfdac35

…silu-mul-quant

review comments and format

8514b0e

Signed-off-by: Sage Moore <[email protected]>

SageMoore force-pushed the sage/silu-mul-quant branch from 70bb71b to 8514b0e Compare December 19, 2024 17:29

mergify bot removed the needs-rebase label Dec 19, 2024

fix amd build

ec1290a

Signed-off-by: Sage Moore <[email protected]>

ProExpertProg reviewed Dec 20, 2024

View reviewed changes

youkaichao approved these changes Dec 20, 2024

View reviewed changes

SageMoore added 3 commits December 20, 2024 19:22

review comments and format

008b725

Signed-off-by: Sage Moore <[email protected]>

minor test fix

4a0ac7e

Signed-off-by: Sage Moore <[email protected]>

Merge branch 'main' of https://github.com/neuralmagic/vllm into sage/…

554012e

…silu-mul-quant

mergify bot added the needs-rebase label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations #10867

[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations #10867

SageMoore commented Dec 3, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 3, 2024

tlrmchlsmth left a comment

tlrmchlsmth left a comment

mergify bot commented Dec 19, 2024

SageMoore commented Dec 19, 2024

ProExpertProg Dec 20, 2024

SageMoore Dec 20, 2024

youkaichao left a comment

mergify bot commented Jan 21, 2025

[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations #10867

Are you sure you want to change the base?

[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations #10867

Conversation

SageMoore commented Dec 3, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 3, 2024

tlrmchlsmth left a comment

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mergify bot commented Dec 19, 2024

SageMoore commented Dec 19, 2024

ProExpertProg Dec 20, 2024

Choose a reason for hiding this comment

SageMoore Dec 20, 2024

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

mergify bot commented Jan 21, 2025

SageMoore commented Dec 3, 2024 •

edited by github-actions bot

Loading