[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036

LeiWang1999 · 2024-07-01T16:30:04Z

Hi all, this PR introduces support for the Microsoft Runtime Kernel Library to enhance our low precision computation capabilities.

Brief Introduction of BitBLAS

BitBLAS is a library to support mixed-precision BLAS operations on GPUs, for example, the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication where $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$.
BitBLAS aims to support efficient mixed-precision DNN model deployment, especially the $W_{wdtype}A_{adtype}$ quantization in large language models (LLMs), for example, the $W_{UINT4}A_{FP16}$ in GPTQ, the $W_{INT2}A_{FP16}$ in BitDistiller, the $W_{INT2}A_{INT8}$ in BitNet-b1.58.

PR Overview

This PR integrates BitBLAS into vLLM by adding examples of its usage. We provide two forms:

Load from GPTQ Checkpoints: This allows the loading of models from GPTQ format checkpoints.
Load from GPTQ CKPT with BitBLAS Format: This enables the loading of models using the BitBLAS format for further optimized performance.

Below are the benchmarking results that we evaluated several months ago:

TODO ITEMS

Update and provide the latest benchmarking results.
1.58Bits Model
Provide Benchmark/Test Scripts

Any feedback and suggestions to improve this integration are appreciated.

robertgshaw2-redhat · 2024-07-01T16:39:27Z

Nice!

LeiWang1999 · 2024-07-01T16:44:51Z

BTW, are there any tools available that can automatically resolve lint issues?

vllm/model_executor/layers/quantization/gptq_bitblas.py:28:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:28:8: F811 Redefinition of unused `bitblas` from line 21
vllm/model_executor/layers/quantization/gptq_bitblas.py:29:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:66:81: E501 Line too long (107 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:172:81: E501 Line too long (85 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:222:81: E501 Line too long (105 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:230:81: E501 Line too long (89 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:233:81: E501 Line too long (110 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:236:81: E501 Line too long (99 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:242:81: E501 Line too long (84 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:253:81: E501 Line too long (94 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:414:81: E501 Line too long (86 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:417:29: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:420:17: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:427:81: E501 Line too long (103 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:433:81: E501 Line too long (116 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:454:81: E501 Line too long (82 > 80)

robertgshaw2-redhat · 2024-07-01T16:48:10Z

BTW, are there any tools available that can automatically resolve lint issues?

vllm/model_executor/layers/quantization/gptq_bitblas.py:28:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:28:8: F811 Redefinition of unused `bitblas` from line 21
vllm/model_executor/layers/quantization/gptq_bitblas.py:29:1: E402 Module level import not at top of file
vllm/model_executor/layers/quantization/gptq_bitblas.py:66:81: E501 Line too long (107 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:172:81: E501 Line too long (85 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:222:81: E501 Line too long (105 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:230:81: E501 Line too long (89 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:233:81: E501 Line too long (110 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:236:81: E501 Line too long (99 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:242:81: E501 Line too long (84 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:253:81: E501 Line too long (94 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:414:81: E501 Line too long (86 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:417:29: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:420:17: G004 Logging statement uses f-string
vllm/model_executor/layers/quantization/gptq_bitblas.py:427:81: E501 Line too long (103 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:433:81: E501 Line too long (116 > 80)
vllm/model_executor/layers/quantization/gptq_bitblas.py:454:81: E501 Line too long (82 > 80)

./format.sh fixes whatever it can, but not everything is automated for fixing (esp line length)

mgoin · 2024-07-01T16:50:08Z

@LeiWang1999 thanks for the WIP, very cool interface with bitblas as a package. Can you explain if the GPTQ benchmarking results in vLLM were run with the base "gptq" kernels or using the "gptq_marlin" interface to take advantage of Marlin kernels? This will be important to compare with the current baseline we consider for GPTQ models in vLLM

LeiWang1999 · 2024-07-01T17:01:40Z

@LeiWang1999 thanks for the WIP, very cool interface with bitblas as a package. Can you explain if the GPTQ benchmarking results in vLLM were run with the base "gptq" kernels or using the "gptq_marlin" interface to take advantage of Marlin kernels? This will be important to compare with the current baseline we consider for GPTQ models in vLLM

Thanks, it utilized exllamav2 during our benchmarking at that time; we will examine the comparison with the Marlin kernel.

LeiWang1999 · 2024-07-19T04:22:02Z

Hi all, I recently update the the supports for 1.58bits model and related bitblas inference kernel for vllm.

		Token Per Second(tok/s)
model	framework	BS16IN32OUT128	BS1IN512OUT1024	B32IN32OUT128
bitnet-3b-1.58bits	pytorch	106.83	49.34	209.03
bitnet-3b-1.58bits	pytorch-bitblas	240.33	103.09	493.31
bitnet-3b-1.58bits	vllm-bitblas	379.25	117.43	752.55
bitnet-3b-1.58bits	vllm-bitblas-cuda-graph	2543.58	1621.08	2731.79

LeiWang1999 · 2024-07-19T04:27:05Z

We will soon do benchmarking with marlin, and looks like the docs build failed because of the dependency for bitblas, do you have any ideas to fix this issue? should we put the bitblas requirements to the doc/requirements or is there some options to skip this dependency? @mgoin

LeiWang1999 · 2024-08-20T18:50:38Z

I think this PR is ready for review. Here is a summary of this update:

We now support BitBLAS as a quantized backend and can use vLLM to serve pretrained models from Hugging Face (in GPTQ, BitNet, or BitBLAS format) with the BitBLAS inference kernel.

We briefly tested the performance using Marlin with the throughput benchmark scripts provided by vLLM on A100:

python benchmark_throughput.py --backend vllm --num-prompts 1 --input-len 32 --output-len 512 --max-model-len 1024 --model "hxbgsyxh/llama-13b-4bit-g-1-bitblas" --quantization "bitblas" 

python benchmark_throughput.py --backend vllm --num-prompts 1 --input-len 32 --output-len 512 --max-model-len 1024 --model "hxbgsyxh/llama-13b-4bit-g-1-marlin" --quantization "marlin"

The performance results are:

Marlin: 122.67 toks/s
BitBLAS: 127.11 toks/s

Some notes:

Marlin requires a workspace for spin lock to perform global reduction, while BitBLAS doesn’t require it.
BitBLAS supports more complex cases compared to Marlin, such as sym=False or 2-bit formats.

Moreover, this PR also adds support for the 1.58-bit BitNET model.

Model	Framework	BS16IN32OUT128	BS1IN512OUT1024	B32IN32OUT128
bitnet-3b-1.58bits	PyTorch	106.83	49.34	209.03
bitnet-3b-1.58bits	PyTorch-BitBLAS	240.33	103.09	493.31
bitnet-3b-1.58bits	vLLM-BitBLAS	379.25	117.43	752.55
bitnet-3b-1.58bits	vLLM-BitBLAS-CUDA-Graph	2543.58	1621.08	2731.79

All correctness checks have been evaluated with the following:

from conftest import VllmRunner
import torch

# Test BitNET model with BitBLAS quantization
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B",
    dtype="half",
    quantization="bitnet_bitblas",
    enforce_eager=True,
    gpu_memory_utilization=0.5,
) as bitnet_model:
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    print("bitnet_bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

# Test another BitBLAS model
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B_bitblas",
    dtype="half",
    quantization="bitblas",
    enforce_eager=True,
) as bitnet_model:
    torch.cuda.profiler.start()
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    torch.cuda.profiler.stop()
    print("bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

# Test GPTQ quantized model
with VllmRunner(
    "hxbgsyxh/opt-125m-4bit-128g",
    dtype="half",
    quantization="gptq",
    enforce_eager=True,
) as marlin_model:
    torch.cuda.profiler.start()
    bitbnet_outputs = marlin_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    torch.cuda.profiler.stop()
    print("bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

torch.compiler.reset()

# Test GPTQ quantized model with BitBLAS
with VllmRunner(
    "hxbgsyxh/opt-125m-4bit-128g-bitblas",
    dtype="half",
    quantization="bitblas",
    enforce_eager=True,
) as bitblas_model:
    torch.cuda.profiler.start()
    bitbnet_outputs = bitblas_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    torch.cuda.profiler.stop()
    print("bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

LeiWang1999 · 2024-08-20T19:16:28Z

any questions are welcome and please take a review when you have a moment :) @mgoin @robertgshaw2-neuralmagic

mgoin · 2024-08-20T20:44:45Z

Thanks for all the work @LeiWang1999! I have a few high-level thoughts first on how to make landing this more straightforward:

Make bitblas an optional dependency and remove from requirements-common.txt. The pypi package is ~90MB, possibly is built for a specific version of PyTorch/CUDA, and seems to include a lot of deps (TVM?). I think it is hard to require, especially for non-CUDA devices. See bitsandbytes or deepspeedfp for an example of how we usually implement lazy import with an exception message to install.
Add support for bitnet in another PR. I think it is worth looking at separately and understanding the pros/cons. I find it a bit surprising that it requires implementing a whole new model and tokenizer class.
gptq_bitblas seems a bit redundant without further benchmarking separating it from gptq_marlin. I agree it is certainly useful where marlin doesn't have support. Also we want to eventually perform a refactor to separate checkpoint formats from kernel implementations, so we will need to revisit this soon.

LeiWang1999 · 2024-08-21T06:46:45Z

Thanks for your suggestions, @mgoin!

For the first suggestion, we’ve restructured the bitblas import to be a lazy import. Additionally, while bitblas still requires a CUDA environment, it is now linked to libcuda.so instead of being tied to a specific CUDA version :).
Regarding he second suggestion, we’ve removed the bitnet-related items in this PR. Let’s continue that discussion in a separate pr.
For the third suggestion, I believe keeping gptq_bitblas is still valuable for formats that Marlin doesn’t support, such as those with dynamic zero points or lower precision. It allows us to directly repack the bitblas format from a gptq checkpoint.

with VllmRunner(
    "hxbgsyxh/llama-13b-4bit-g-1", # model with gptq format
    dtype="half",
    quantization="bitblas",
    enforce_eager=True,
) as bitblas_model:
    torch.cuda.profiler.start()
    bitbnet_outputs = bitblas_model.generate_greedy(
        ["Hi, tell me about microsoft?"], max_tokens=128
    )
    torch.cuda.profiler.stop()
    print("bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

mgoin

Thanks for splitting it up! I left a first round of clear nits/issues and will do a more in-depth pass later. There seem to be a lot of various formatting changes, for some reason

docs/source/quantization/bitblas.rst

vllm/config.py

vllm/model_executor/layers/quantization/bitblas.py

LeiWang1999 · 2024-08-26T03:56:42Z

Hi @mgoin , are there any further updates or actions we should take?

mgoin

Hi @LeiWang1999 I'm very sorry for the delay, I lost track of this PR and didn't catch your ping.

There has been an ongoing refactor for quantization methods to use a new set of vLLMParameters (see gptq_marlin PR #7281) to simplify weight loading, but we could delay this for bitblas to make it easier to land this initial PR.

Also as mentioned in #7725 (comment), there will be a few merge conflicts with main.

If/when you have bandwidth to finish this out, I promise to get this over the line asap. Please let me know!

mgoin · 2024-09-13T17:42:01Z

vllm/model_executor/layers/quantization/gptq_bitblas.py

+        if layer.bitblas_state == GPTQBitBLASState.REPACK:
+            layer.bitblas_state = GPTQBitBLASState.READY
+
+            # Newly generated tensors need to replace existing tensors that are
+            # already registered as parameters by vLLM (and won't be freed)
+            def replace_tensor(name, new_t):
+                # It is important to use copy_() here since it ensures
+                # the same buffer is reused
+                getattr(layer, name).copy_(
+                    new_t.view(getattr(layer, name).dtype).view(
+                        getattr(layer, name).shape))
+                del new_t
+
+            # Repack weights
+            bitblas_qweight, bitblas_scales, bitblas_qzeros = (
+                self.repack_bitblas_from_gptq(
+                    layer.qweight,
+                    layer.scales,
+                    layer.qzeros,
+                ))
+            replace_tensor("qweight", bitblas_qweight)
+            replace_tensor("scales", bitblas_scales)
+            replace_tensor("qzeros", bitblas_qzeros)


It would be best to move this into a process_weights_after_loading function we have specifically for this purpose, example in gptq_marlin.py

vllm/vllm/model_executor/layers/quantization/gptq_marlin.py

Line 287 in a84e598

def process_weights_after_loading(self, layer: torch.nn.Module) -> None:

Thanks, I'll take a look. I'm currently working on the stream-k template in bitblas :)

mergify · 2024-11-11T16:11:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LeiWang1999.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LeiWang1999 · 2024-11-11T16:23:57Z

@mgoin , apologies for the delay in updates, over the last two months, we've made several significant improvements:

Add supports for Flash Attention.
Enable INT4xINT2/INT4 and BF16 Activation Support, which has been utilized in bitnet a4.8
Support H100.
Support AMD CDNA Backend (tested on MI250/MI300).

Given the recent public interest in the bitnet.cpp project (which runs bitnet on cpu), I think it would be a great opportunity to advance the integration of VLLM with BitNet (issue #7725) with gpu. :)

Would you mind take a review of this pull request?

mgoin · 2024-11-12T20:59:09Z

@LeiWang1999 thanks for the ping and updates, excited to review!

vllm/model_executor/layers/quantization/gptq_marlin.py

vllm/model_executor/layers/quantization/utils/bitblas_utils.py

vllm/model_executor/layers/quantization/kernels/bitblas.py

Signed-off-by: xinyuxiao <[email protected]>

Alex4210987 · 2025-04-16T11:47:10Z

I think this is ready to go with green CI!

seems like CI failed due to network error or something unrelated to our Bitblas. Would you kindly take a look and rerun the ci? @mgoin

Signed-off-by: xinyuxiao <[email protected]>

Alex4210987 · 2025-04-17T15:39:23Z

I think this is ready to go with green CI!

guess it’s ready for further review!

mgoin · 2025-04-21T22:31:13Z

@hmellor can you unblock please?

mgoin · 2025-04-22T14:00:44Z

Congrats on making this through, it was a huge effort I know! Next time will be much easier 😅

Alex4210987 · 2025-04-22T14:16:40Z

Congrats on making this through, it was a huge effort I know! Next time will be much easier 😅

Surely, and thanks for your help and effort.

LeiWang1999 · 2025-04-22T14:36:03Z

@Alex4210987 likely we can push bitnet support if you guys have interest in this part :)

mgoin · 2025-04-22T20:02:19Z

Bitnet would be great!

…omputation - BitBLAS (vllm-project#6036) Signed-off-by: xinyuxiao <[email protected]> Co-authored-by: xinyuxiao <[email protected]> Signed-off-by: Frieda (Jingying) Huang <[email protected]>

…omputation - BitBLAS (vllm-project#6036) Signed-off-by: xinyuxiao <[email protected]> Co-authored-by: xinyuxiao <[email protected]>

…omputation - BitBLAS (vllm-project#6036) Signed-off-by: xinyuxiao <[email protected]> Co-authored-by: xinyuxiao <[email protected]> Signed-off-by: Agata Dobrzyniewicz <[email protected]>

…omputation - BitBLAS (vllm-project#6036) Signed-off-by: xinyuxiao <[email protected]> Co-authored-by: xinyuxiao <[email protected]> Signed-off-by: Mu Huai <[email protected]>

…omputation - BitBLAS (vllm-project#6036) Signed-off-by: xinyuxiao <[email protected]> Co-authored-by: xinyuxiao <[email protected]> Signed-off-by: minpeter <[email protected]>

LeiWang1999 marked this pull request as ready for review July 19, 2024 04:23

LeiWang1999 mentioned this pull request Aug 21, 2024

[Model] 1.58bits BitNet Model Support #7725

Closed

1 task

mgoin reviewed Aug 21, 2024

View reviewed changes

mgoin reviewed Sep 13, 2024

View reviewed changes

LeiWang1999 requested review from DarkLight1337 and ywang96 as code owners October 23, 2024 17:41

LeiWang1999 requested a review from robertgshaw2-redhat as a code owner November 1, 2024 07:01

mergify bot added the documentation Improvements or additions to documentation label Nov 1, 2024

mergify bot added needs-rebase and removed needs-rebase labels Nov 11, 2024