-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Condition to achieve linear speedup? #15
Comments
Overhead of activation quantization using simple PyTorch operation is substantial but the kernel itself is slower than nn.Linear for most cases. |
@jiwonsong-dev There is online activation quantization using simple PyTorch in |
Is the kernel integrated to vLLM is the same one in the repo? |
@jiwonsong-dev The kernel is the same with that in vLLM. If there is no other operations like dtype conversion and reshape in your modified QuantLinear, the QuantLinear should deliver the similar performance with directly using the gemm kernel. Generally, the QuantLinear is only used for the simple inference in our repo. I recommend you to try vLLM for practical inference. |
I checked your fork of Marlin repository and saw actual speedup via benchmark codes. Thank you for kind response! |
Is there any specific reason why permutation is different when packing channel quantized weights? Per group follows original Marlin format. |
@jiwonsong-dev It is relevant with the mma instruction's requirements https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m16n8k16-with-integer-type. |
@HandH1998 The puzzle from me is that, I am already using the w4a8-per-channel version, with no group, why is w4a8 first-token is still such slow? Have you ever analyzed details like this for your w4a8-no-group kernel? Any further advice to optimize the kernel? |
@brisker It it normal that w4a8 first-token is slower than w8a8, since the additional dequant operation (on slower cuda core) of w4a8 slows down tha main loop, even though the dequant overhead is small. In my experiments, if your case has a couple of decoding iteration, the final w4a8 speed is always falser than w8a8 for better decoding speed. Here we provide the detailed results. input length=1024 TTFT(ms)
TPOT(ms)
|
@HandH1998 |
TTFT: Time To First Token |
@HandH1998 |
cublas w8a8 gemm from vllm-project/vllm#1508. But cublas and cutlass should have similar performance. |
TPOT has already includes the first decoding time? or you have excluded first token time away? |
It doesn't include the first token. |
considering this sheet,
|
|
@HandH1998 I just confused why w4a8 is faster than w8a8 on 70B model? It seem that cannot meet the theoretic roofline model, the figure in Qserve... |
I tested latency of QuantLinear forward with various sizes of input and feature sizes.
But for token counts from 1 to 1024, I cannot see any speedup compared to AWQ W4A16 kernel and the results were suboptimal to pytorch FP16 Linear in most cases.
I tested weight sizes (4096, 4096), (5120, 5120), (6656, 6656), (8192, 8192) which match linear sizes of LLaMA model family on A6000 and RTX3090 GPU.
I see the experiments in the paper was taken on A100 GPU.
Is there any specific setting or condition to see the speedup aligns with the results on paper?
The text was updated successfully, but these errors were encountered: