Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] 3-bit support #117

Open
1 task
mobicham opened this issue Aug 1, 2024 · 6 comments
Open
1 task

[Feature Request] 3-bit support #117

mobicham opened this issue Aug 1, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@mobicham
Copy link

mobicham commented Aug 1, 2024

Great work!
Any chance you add support for 3-bit ? I know the bitpacking is a bit tricky with 3-bit, but it would be great to a have a 3-bit kernel for linear quantization, since the only one available is for LUT via flute, and 2-bit quantization quality for smaller pre-trained models is sub-optimal for production.
Thanks!

Tasks

Preview Give feedback
@mobicham mobicham changed the title 3-bit support [Feature Request] 3-bit support Aug 1, 2024
@LeiWang1999
Copy link
Contributor

hi @mobicham , we will consider supporting it in our incoming release, is the flute implementation not optimal? (for example, have approximately 5x speedup than fp16 gemv)?

@mobicham
Copy link
Author

mobicham commented Aug 1, 2024

In my benchmarks on the 3090, it's not that fast end-2-end. Llama3 8B decoding speed for 4-bit is about 67 tokens/sec with flute vs. 97 tokens/sec for torchao/bitblas (group-size=64, batch-size=1).
The quality tends to be better with LUT vs. linear quantization though, as expected, since linear quantization is just a special case of LUT. Linear quantization would run faster since there's no cost to read the LUT from the shared memory.

@LeiWang1999
Copy link
Contributor

@mobicham , got it! thanks for your sharing.

@brisker
Copy link

brisker commented Aug 1, 2024

@LeiWang1999
Any benchmark speed test for w4a8 compared to fp16 ?

@LeiWang1999
Copy link
Contributor

LeiWang1999 commented Aug 1, 2024

@brisker , We provide the benchmark scripts for bitblas matmul:

https://github.com/microsoft/BitBLAS/blob/main/benchmark/operators/benchmark_bitblas_matmul.py

@brisker
Copy link

brisker commented Aug 5, 2024

@LeiWang1999

  1. In the link you provide, I noticed that you have compared bitblas-w4a16 with marlin-w4a16, here

    I want to ask, are they both tested on w4-per-channel quantized? (with no group-wise weight-quantize tricks.)

  2. Is the w4a8 quantize pipeline integrated into vLLM yet?

@LeiWang1999 LeiWang1999 added the enhancement New feature or request label Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants