[Feature Request] 3-bit support #117

mobicham · 2024-08-01T08:00:31Z

Great work!
Any chance you add support for 3-bit ? I know the bitpacking is a bit tricky with 3-bit, but it would be great to a have a 3-bit kernel for linear quantization, since the only one available is for LUT via flute, and 2-bit quantization quality for smaller pre-trained models is sub-optimal for production.
Thanks!

Tasks

Give feedback

3bit dtype support
Options

LeiWang1999 · 2024-08-01T08:14:41Z

hi @mobicham , we will consider supporting it in our incoming release, is the flute implementation not optimal? (for example, have approximately 5x speedup than fp16 gemv)?

mobicham · 2024-08-01T08:27:56Z

In my benchmarks on the 3090, it's not that fast end-2-end. Llama3 8B decoding speed for 4-bit is about 67 tokens/sec with flute vs. 97 tokens/sec for torchao/bitblas (group-size=64, batch-size=1).
The quality tends to be better with LUT vs. linear quantization though, as expected, since linear quantization is just a special case of LUT. Linear quantization would run faster since there's no cost to read the LUT from the shared memory.

LeiWang1999 · 2024-08-01T08:40:34Z

@mobicham , got it! thanks for your sharing.

brisker · 2024-08-01T12:16:21Z

@LeiWang1999
Any benchmark speed test for w4a8 compared to fp16 ?

LeiWang1999 · 2024-08-01T12:18:27Z

@brisker , We provide the benchmark scripts for bitblas matmul:

https://github.com/microsoft/BitBLAS/blob/main/benchmark/operators/benchmark_bitblas_matmul.py

brisker · 2024-08-05T07:20:10Z

@LeiWang1999

In the link you provide, I noticed that you have compared bitblas-w4a16 with marlin-w4a16, here

I want to ask, are they both tested on w4-per-channel quantized? (with no group-wise weight-quantize tricks.)
Is the w4a8 quantize pipeline integrated into vLLM yet?

mobicham changed the title ~~3-bit support~~ [Feature Request] 3-bit support Aug 1, 2024

LeiWang1999 added the enhancement New feature or request label Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] 3-bit support #117

[Feature Request] 3-bit support #117

mobicham commented Aug 1, 2024 •

edited by LeiWang1999

Loading

Tasks

LeiWang1999 commented Aug 1, 2024

mobicham commented Aug 1, 2024

LeiWang1999 commented Aug 1, 2024

brisker commented Aug 1, 2024

LeiWang1999 commented Aug 1, 2024 •

edited

Loading

brisker commented Aug 5, 2024 •

edited

Loading

[Feature Request] 3-bit support #117

[Feature Request] 3-bit support #117

Comments

mobicham commented Aug 1, 2024 • edited by LeiWang1999 Loading

Tasks

LeiWang1999 commented Aug 1, 2024

mobicham commented Aug 1, 2024

LeiWang1999 commented Aug 1, 2024

brisker commented Aug 1, 2024

LeiWang1999 commented Aug 1, 2024 • edited Loading

brisker commented Aug 5, 2024 • edited Loading

mobicham commented Aug 1, 2024 •

edited by LeiWang1999

Loading

LeiWang1999 commented Aug 1, 2024 •

edited

Loading

brisker commented Aug 5, 2024 •

edited

Loading