Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitsandbytes]: support read bnb pre-quantized model #5753

Merged
merged 2 commits into from
Jul 23, 2024

Conversation

thesues
Copy link
Contributor

@thesues thesues commented Jun 21, 2024

huggingface is bitsandbytes pre quantized model such as

  • lllyasviel/omost-llama-3-8b-4bit
  • unsloth/tinyllama-bnb-4bit
    ...

this will support these pre quantized for vllm

@chenqianfzh @Yard1 @jeejeelee

@thesues thesues force-pushed the nf5 branch 2 times, most recently from 69decab to 95dabaa Compare June 21, 2024 23:26
@chenqianfzh
Copy link
Contributor

@Yard1 @jeejeelee @mgoin

The changes from @thesues improves my previous work on QLoRA & Bnb (#4776). It now supports models whose weights are published as bnb-quantized.

Other than that, it also cleaned my previous code and fixed a bug ( the previous version will run into error in scenarios such as GQA).

The changes looks good to me. Could you also take a look?


def generator():
def prequant_generator() -> Generator:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fond of the term "prequant" here, could it be something along the lines of "quantized_checkpoint"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

vllm/model_executor/model_loader/loader.py Outdated Show resolved Hide resolved
#filter all weights whose suffix is not ".weight"
if not weight_name.endswith(".weight"):
continue
#.quant_state.bitsandbytes__nf4 is not supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What isn't supported about it? It seems like no exception is being thrown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a typo here, it should be "only quant_state.bitsandbytes__nf4 is supported". other libraries such as hf transformer supports quant_state.bitsandbytes__fp4.

qweight_iterator, quant_state_dict = (
self._get_quantized_weights_iterator(model_config.model,
model_config.revision))
pre_quant = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on pre_quant, it should be talking about the checkpoint being quantized

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the improvements and ability to natively load! I think it would be great to followup with a documentation page in the quantization section to show how to deploy bnb models directly in vLLM, perhaps straight from a quick finetune in unsloth.

@thesues thesues force-pushed the nf5 branch 3 times, most recently from fb0a6d2 to b7c3aae Compare June 27, 2024 00:10
@thesues
Copy link
Contributor Author

thesues commented Jun 27, 2024

Appreciate the improvements and ability to natively load! I think it would be great to followup with a documentation page in the quantization section to show how to deploy bnb models directly in vLLM, perhaps straight from a quick finetune in unsloth.

sure, I added a very simple bnb.rst in docs

@thesues
Copy link
Contributor Author

thesues commented Jul 2, 2024

@mgoin can you review this version?

#unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint.
model_id = "unsloth/tinyllama-bnb-4bit"
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
,quantization="bitsandbytes", load_format="bitsandbytes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant commma.

docs/source/quantization/bnb.rst Show resolved Hide resolved
vllm/model_executor/layers/quantization/bitsandbytes.py Outdated Show resolved Hide resolved
@vrdn-23
Copy link
Contributor

vrdn-23 commented Jul 15, 2024

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)!
cc @mgoin @QwertyJack

docs/source/quantization/bnb.rst Outdated Show resolved Hide resolved
docs/source/quantization/bnb.rst Outdated Show resolved Hide resolved
docs/source/quantization/bnb.rst Show resolved Hide resolved
docs/source/quantization/bnb.rst Outdated Show resolved Hide resolved
docs/source/quantization/bnb.rst Outdated Show resolved Hide resolved
tests/quantization/test_bitsandbytes.py Outdated Show resolved Hide resolved
vllm/model_executor/model_loader/loader.py Outdated Show resolved Hide resolved
@thesues thesues force-pushed the nf5 branch 12 times, most recently from 8e891bb to bb014ef Compare July 15, 2024 22:29
@QwertyJack
Copy link
Contributor

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! cc @mgoin @QwertyJack

SGTM

@thesues
Copy link
Contributor Author

thesues commented Jul 19, 2024

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! cc @mgoin @QwertyJack

SGTM

is there anything I could do to improve this patch?

@eliasecchig
Copy link

Any ETA for this feature to be merged? Really keen to use it

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pinging, LGTM with a small docs fix

docs/source/quantization/bnb.rst Outdated Show resolved Hide resolved
@thesues thesues requested a review from QwertyJack July 22, 2024 19:48
@QwertyJack
Copy link
Contributor

Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! cc @mgoin @QwertyJack

SGTM

As I mentioned earlier, Look Good To Me!

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2024
@njhill
Copy link
Member

njhill commented Jul 23, 2024

@thesues do you think you could resolve the new conflicts?

@Maxusmusti
Copy link
Contributor

I've just tried installing from this PR's branch and testing it out, this solution worked for me!

python3 -m vllm.entrypoints.openai.api_server --model <bnb_4bit_model> --quantization bitsandbytes --load-format bitsandbytes

used to produce issues like:
ValueError: Cannot find any of ['adapter_name_or_path'] in the model's quantization config.
and once args were added:
KeyError: 'model.layers.0.self_attn.qkv_proj.weight'

but now model loading seems to work great:

WARNING 07-23 16:35:07 config.py:218] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
...
INFO 07-23 16:35:08 loader.py:800] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 07-23 16:35:10 model_runner.py:164] Loading model weights took 3.6494 GB
INFO 07-23 16:35:11 gpu_executor.py:83] # GPU blocks: 1977, # CPU blocks: 512
...
INFO:     Started server process [15386]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000/ (Press CTRL+C to quit)

@mgoin
Copy link
Member

mgoin commented Jul 23, 2024

It would be great if someone could validate that the new Llama 3.1 8B BNB checkpoint loads: https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4

@thesues
Copy link
Contributor Author

thesues commented Jul 23, 2024

It would be great if someone could validate that the new Llama 3.1 8B BNB checkpoint loads: https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4

Meta-Llama-3.1-8B-Instruct-BNB-NF4 works as expected.

from vllm import LLM, SamplingParams


model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4"
llm = LLM(model=model_id, trust_remote_code=True, enforce_eager=True, quantization="bitsandbytes", load_format="bitsandbytes", m
ax_model_len=4096)
sampling_params = SamplingParams(max_tokens=20)

outputs = llm.generate("The capital of United States is", sampling_params)
for output in outputs:
    print(output.outputs[0].text)

logs:

WARNING 07-23 21:01:21 config.py:246] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 07-23 21:01:21 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4', speculative_config=None, tokenizer='hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-23 21:01:22 model_runner.py:680] Starting to load model hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4...
INFO 07-23 21:01:22 loader.py:821] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 07-23 21:01:22 weight_utils.py:224] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  5.58it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.49it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.68it/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  5.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.48it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.66it/s]

INFO 07-23 21:01:25 model_runner.py:692] Loading model weights took 5.3424 GB
INFO 07-23 21:01:27 gpu_executor.py:102] # GPU blocks: 6603, # CPU blocks: 2048
Processed prompts: 100%|█████████████████████| 1/1 [00:01<00:00,  1.57s/it, est. speed input: 3.83 toks/s, output: 25.52 toks/s]
 Washington, D.C.

@thesues
Copy link
Contributor Author

thesues commented Jul 23, 2024

@thesues do you think you could resolve the new conflicts?

done

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for testing further and sticking with this, LGTM!

@mgoin mgoin enabled auto-merge (squash) July 23, 2024 22:28
@mgoin mgoin merged commit 87525fa into vllm-project:main Jul 23, 2024
73 checks passed
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
n1hility pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 24, 2024
cduk pushed a commit to cduk/vllm-pascal that referenced this pull request Aug 6, 2024
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants