-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] 3-bit support #117
Comments
hi @mobicham , we will consider supporting it in our incoming release, is the flute implementation not optimal? (for example, have approximately 5x speedup than fp16 gemv)? |
In my benchmarks on the 3090, it's not that fast end-2-end. Llama3 8B decoding speed for 4-bit is about 67 tokens/sec with flute vs. 97 tokens/sec for torchao/bitblas (group-size=64, batch-size=1). |
@mobicham , got it! thanks for your sharing. |
@LeiWang1999 |
@brisker , We provide the benchmark scripts for bitblas matmul: https://github.com/microsoft/BitBLAS/blob/main/benchmark/operators/benchmark_bitblas_matmul.py |
|
Great work!
Any chance you add support for 3-bit ? I know the bitpacking is a bit tricky with 3-bit, but it would be great to a have a 3-bit kernel for linear quantization, since the only one available is for LUT via flute, and 2-bit quantization quality for smaller pre-trained models is sub-optimal for production.
Thanks!
Tasks
The text was updated successfully, but these errors were encountered: