-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No reduction in model size #15
Comments
Hello, the weights obtained through this way are the calibrated fake quantization weights. To achieve actual weight compression, a packing operation is required to store the weights. For example, using INT2 quantization, the pack operation will use an INT32 to store 16 INT2 element values. |
We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a week after the National Day. We did not release this link before because the engine inside ByteDance is a pure C++ solution. |
Thank you for your work. May I ask when you guys are planning to release the code related to real packing weight and Torch inference |
It is expected that this quarter
From: ***@***.***>
Date: Tue, Oct 15, 2024, 18:58
Subject: [External] Re: [bytedance/ABQ-LLM] No reduction in model size
(Issue #15)
To: ***@***.***>
Cc: ***@***.***>, "Comment"<
***@***.***>
We are developing a complete link from pseudo-quantized models to real
packing weights and directly executing WxAy quantized inference in Torch,
which is expected to be released within a week after the National Day. We
did not release this link before because the engine inside ByteDance is a
pure C++ solution.
Thank you for your work. May I ask when you guys are planning to release
the code related to real packing weight and Torch inference
—
Reply to this email directly, view it on GitHub
<#15 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL4LCKNANSLUL2V2UTNCHHTZ3TYMDAVCNFSM6AAAAABPETUI6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTGU3DIOJZGQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@lswzjuer |
I use this command to quantize llama2-7b-chat model, but the model size dosen't change.
CUDA_VISIBLE_DEVICES=0 python3 main.py
--model /mnt/home/model/llama2-7b-chat-hf
--epochs 20 --output_dir ./log/llama2-7b-w2a8
--eval_ppl --wbits 2 --abits 8 --lwc --let
--tasks piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande
--real_quant
--save_dir /mnt/home/model/abq-llm/llama2-7b-w2a8
The text was updated successfully, but these errors were encountered: