LMDeploy provides functions for quantization and inference of large language models using 8-bit integers.
Before starting inference, ensure that lmdeploy and openai/triton are correctly installed. Execute the following commands to install these:
pip install lmdeploy
pip install triton>=2.1.0
For performing 8-bit weight model inference, you can directly download the pre-quantized 8-bit weight models from LMDeploy's model zoo. For instance, the 8-bit Internlm-chat-7B model is available for direct download from the model zoo:
git-lfs install
git clone https://huggingface.co/lmdeploy/internlm-chat-7b-w8 (coming soon)
Alternatively, you can manually convert original 16-bit weights into 8-bit by referring to the content under the "8bit Weight Quantization" section. Save them in the internlm-chat-7b-w8 directory, using the command below:
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
Afterwards, use the following command to interact with the model via the terminal:
lmdeploy chat torch ./internlm-chat-7b-w8
Coming soon...
Coming soon...
Performing 4bit weight quantization involves three steps:
- Smooth Weights: Start by smoothing the weights of the Language Model (LLM). This process makes the weights more amenable to quantizing.
- Replace Modules: Locate DecoderLayers and replace the modules RSMNorm and nn.Linear with QRSMNorm and QLinear modules respectively. These 'Q' modules are available in the lmdeploy/pytorch/models/q_modules.py file.
- Save the Quantized Model: Once you've made the necessary replacements, save the new quantized model.
The script lmdeploy/lite/apis/smooth_quant.py
accomplishes all three tasks detailed above. For example, you can obtain the model weights of the quantized Internlm-chat-7B model by running the following command:
lmdeploy lite smooth_quant internlm/internlm-chat-7b --work-dir ./internlm-chat-7b-w8
After saving, you can instantiate your quantized model by calling the from_pretrained interface.