[Model] 1.58bits BitNet Model Support #7725

LeiWang1999 · 2024-08-21T08:45:26Z

This pull request is a follow-up to PR #6036. In this PR, we introduce the BitNet model and provide an efficient inference kernel with the BitBLAS backend. Here are the performance benchmarks:

Model	Framework	BS16IN32OUT128	BS1IN512OUT1024	B32IN32OUT128
BitNet-3B-1.58bits	PyTorch	106.83	49.34	209.03
BitNet-3B-1.58bits	PyTorch-BitBLAS	240.33	103.09	493.31
BitNet-3B-1.58bits	vLLM-BitBLAS	379.25	117.43	752.55
BitNet-3B-1.58bits	vLLM-BitBLAS-CUDA-Graph	2543.58	1621.08	2731.79

To answer the question raised by @mgoin in PR #6036, I believe a new BitNet model is necessary because the open-source BitNet implementation provides a unique tokenizer and model architecture, which includes an additional RMS layer compared to LLaMA. Additionally, the BitNet integration example with llama.cpp also introduces a new model architecture (refer to: llama.cpp.pr.7931).

Example Usage:

from conftest import VllmRunner

# Test BitNET model with BitBLAS quantization
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B",
    dtype="half",
    quantization="bitnet_bitblas",
    enforce_eager=True,
    gpu_memory_utilization=0.5,
) as bitnet_model:
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    print("bitnet_bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

# Test another BitBLAS model
with VllmRunner(
    "hxbgsyxh/bitnet_b1_58-3B_bitblas",
    dtype="half",
    quantization="bitblas",
    enforce_eager=True,
) as bitnet_model:
    bitbnet_outputs = bitnet_model.generate_greedy(
        ["Hi, tell me about Microsoft?"], max_tokens=128
    )
    print("bitblas:")
    print(bitbnet_outputs[0][0])
    print(bitbnet_outputs[0][1])

pr [Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036 should be merged

…nearMethod constructor

…las-intg

github-actions · 2024-08-21T08:45:38Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

DarkLight1337 · 2024-09-13T17:28:55Z

Sorry for the long delay, @mgoin can you follow up on this and the previous PR?

A quick heads-up that the new locations of the model tests have been adjusted in #7820, so please merge from main.

mergify · 2024-11-12T21:15:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LeiWang1999.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LeiWang1999 added 27 commits July 1, 2024 15:37

Support Repack from GPTQ.

2be6218

chore: Remove unused input_size and output_size variables in MarlinLi…

b92de92

…nearMethod constructor

Support BitNet Model for 1.58bits.

71ea469

Lint Fix

dfa6b2f

lint fix

8d2c635

Lint Fix for line length

41bb18e

Support Loading 1.58B Model with BitBLAS Format

29ac34d

Improve performance for bitnet

7f69aef

Merge branch 'main' of https://github.com/vllm-project/vllm into bitb…

01a789a

…las-intg

fix lm_head for gptq model refactor

a973123

linx fix

aea1f4c

handle compressed scale weight.

17128d5

lint fix

1741ed4

remove partial weight load for sw

726a1f7

apply torch compile for uncompressed weight.

68c8052

Merge branch 'main' of https://github.com/vllm-project/vllm into bitb…

6eb2870

…las-intg

merge bug fix

52418ef

lint fix

a15ba12

fix torch compile issue

53babae

bug fix.

40a4e53

BENCHMARK SCRIPTS

d316a87

Merge branch 'main' of https://github.com/vllm-project/vllm into bitb…

4d40275

…las-intg

Implement Test

bffc05b

lint fix

8b0972b

install bitblas by default to pass the doc gen.

8e1a7e8

hide the bitblas import

7fbbccf

import fix

c487e69

mgoin mentioned this pull request Sep 13, 2024

[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS #6036

Open

3 tasks

mgoin self-requested a review November 12, 2024 21:14

mergify bot added the ci/build label Nov 12, 2024

mergify bot added the needs-rebase label Nov 12, 2024

simon-mo requested review from DarkLight1337 and ywang96 as code owners November 26, 2024 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] 1.58bits BitNet Model Support #7725

[Model] 1.58bits BitNet Model Support #7725

LeiWang1999 commented Aug 21, 2024 •

edited

Loading

github-actions bot commented Aug 21, 2024

DarkLight1337 commented Sep 13, 2024

mergify bot commented Nov 12, 2024

[Model] 1.58bits BitNet Model Support #7725

Are you sure you want to change the base?

[Model] 1.58bits BitNet Model Support #7725

Conversation

LeiWang1999 commented Aug 21, 2024 • edited Loading

github-actions bot commented Aug 21, 2024

DarkLight1337 commented Sep 13, 2024

mergify bot commented Nov 12, 2024

LeiWang1999 commented Aug 21, 2024 •

edited

Loading