-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bitsandbytes]: support read bnb pre-quantized model #5753
Conversation
69decab
to
95dabaa
Compare
The changes from @thesues improves my previous work on QLoRA & Bnb (#4776). It now supports models whose weights are published as bnb-quantized. Other than that, it also cleaned my previous code and fixed a bug ( the previous version will run into error in scenarios such as GQA). The changes looks good to me. Could you also take a look? |
|
||
def generator(): | ||
def prequant_generator() -> Generator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not fond of the term "prequant" here, could it be something along the lines of "quantized_checkpoint"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
#filter all weights whose suffix is not ".weight" | ||
if not weight_name.endswith(".weight"): | ||
continue | ||
#.quant_state.bitsandbytes__nf4 is not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What isn't supported about it? It seems like no exception is being thrown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a typo here, it should be "only quant_state.bitsandbytes__nf4 is supported". other libraries such as hf transformer supports quant_state.bitsandbytes__fp4.
qweight_iterator, quant_state_dict = ( | ||
self._get_quantized_weights_iterator(model_config.model, | ||
model_config.revision)) | ||
pre_quant = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto on pre_quant, it should be talking about the checkpoint being quantized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the improvements and ability to natively load! I think it would be great to followup with a documentation page in the quantization section to show how to deploy bnb models directly in vLLM, perhaps straight from a quick finetune in unsloth.
fb0a6d2
to
b7c3aae
Compare
sure, I added a very simple bnb.rst in docs |
@mgoin can you review this version? |
docs/source/quantization/bnb.rst
Outdated
#unsloth/tinyllama-bnb-4bit is a pre-quantized checkpoint. | ||
model_id = "unsloth/tinyllama-bnb-4bit" | ||
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \ | ||
,quantization="bitsandbytes", load_format="bitsandbytes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant commma.
Is this PR good to merge now? It would be really great if we could get it in before the next scheduled release (#6433)! |
8e891bb
to
bb014ef
Compare
SGTM |
is there anything I could do to improve this patch? |
Any ETA for this feature to be merged? Really keen to use it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pinging, LGTM with a small docs fix
As I mentioned earlier, Look Good To Me! |
@thesues do you think you could resolve the new conflicts? |
I've just tried installing from this PR's branch and testing it out, this solution worked for me!
used to produce issues like: but now model loading seems to work great:
|
It would be great if someone could validate that the new Llama 3.1 8B BNB checkpoint loads: https://huggingface.co/hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4 |
Meta-Llama-3.1-8B-Instruct-BNB-NF4 works as expected.
logs:
|
…el/omost-llama-3-8b-4bits
Co-authored-by: Michael Goin <[email protected]>
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for testing further and sticking with this, LGTM!
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Alvant <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
huggingface is bitsandbytes pre quantized model such as
...
this will support these pre quantized for vllm
@chenqianfzh @Yard1 @jeejeelee