-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torch.compile] Add torch inductor pass for fusing silu_and_mul with subsequent scaled_fp8_quant operations #10867
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
27be0bd
to
e2fda7f
Compare
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Focused on csrc/quantization/activation_kernels.cu
. spotted a couple of potential int32_t overflows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments - LGTM if we can support non power of two hidden sizes
21c0f3d
to
70bb71b
Compare
This pull request has merge conflicts that must be resolved before it can be |
…silu-mul-quant
Signed-off-by: Sage Moore <[email protected]>
70bb71b
to
8514b0e
Compare
Apologies for the noise. I accidentally added my signature to a bunch of irrelevant commits which pulled them into the PR temporarily. Things should be sorted now. |
Signed-off-by: Sage Moore <[email protected]>
} // namespace vllm | ||
|
||
// Launch activation, gating, and quantize kernel. | ||
#define LAUNCH_ACTIVATION_GATE_KERNEL(KERNEL) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this needs a macro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied what the existing act_and_mul kernel does. This allows us to just drop in kernels for the other activation functions. I'm in favor of keeping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glad to see more fusion passes, I will hand it over to @tlrmchlsmth and @ProExpertProg for detailed review.
Signed-off-by: Sage Moore <[email protected]>
Signed-off-by: Sage Moore <[email protected]>
…silu-mul-quant
This pull request has merge conflicts that must be resolved before it can be |
Credit to @LucasWilkinson for the kernel.
This pass currently only supports static per-tensor quantization. Other quantization schemes will be included in a subsequent PRs.
I've attached some QPS sweeps that were run using
neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
on an H100. Generally speaking, this pass improves the TPOT of FP8 Llama by 2-3%. There are similar improvements with TTFT with the exception of 20QPS which is much (~2x) faster.fused_results
torch_compile_results