cannot quantize bge onnx model(embedding model) without performace loss #2145

chuangzhidan · 2025-01-02T09:59:15Z

System Info

ubuntu:
Python 3.12.4
optimum Version: 1.22.0
A800 GPU

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
import time

def load_and_infer_onnx_model(save_dir, sentences):
tokenizer = AutoTokenizer.from_pretrained(save_dir)
model_ort = ORTModelForFeatureExtraction.from_pretrained(save_dir, file_name="model.onnx")
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
start_time = time.time()
with torch.no_grad():
model_output_ort = model_ort(**encoded_input)
end_time = time.time()
print(f"Inference time : {end_time - start_time:.6f} seconds")
return model_output_ort['last_hidden_state']

save_dir = '/media/data/llm/bge-large-zh-onnx'
sentences = ["样例数据-1", "样例数据-2"] # or any other sentences
start_time = time.time()
output = load_and_infer_onnx_model(save_dir, sentences)
end_time = time.time()
print("Inference output:", output)
print(f"total load and Inference time : {end_time - start_time:.6f} seconds")

from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

Load PyTorch model and convert to ONNX

quantizer = ORTQuantizer.from_pretrained(save_dir)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

Quantize the model

start_time = time.time()
model_quantized_path = quantizer.quantize(
save_dir="/media/data/llm/bge-large-zh-quant",
quantization_config=dqconfig,
# model_file_name="model.onnx",
file_suffix=""
)

end_time = time.time()
print(f"done！time spent: {end_time - start_time:.6f} seconds")
start_time = time.time()
output = load_and_infer_onnx_model("/media/data/llm/bge-large-zh-quant/", sentences)
end_time = time.time()
print("Inference output:", output)
print(f"total load and Inference time : {end_time - start_time:.6f} seconds")

this is the quantized onnx model output ,1/4 size of the origin onnx model. twice faster in inference but not the same output;s logits.
Inference output: tensor([[[ 0.0172, 0.3434, -0.6531, ..., -0.5857, 0.3458, -0.5848],
[ 0.6218, 0.8631, -0.9589, ..., -0.2100, 0.4544, -1.0890],
[ 0.3047, -0.0212, -0.9248, ..., -0.1990, 0.7436, -0.6348],
...,
[ 0.2699, 0.2679, -0.9100, ..., -0.0953, 0.0020, -1.3923],
[-0.0181, 0.1850, -0.9733, ..., -0.0529, 0.3307, -0.8774],
[-0.0032, 0.3302, -0.6237, ..., -0.5871, 0.4097, -0.6030]],

    [[ 0.2260,  0.2415, -0.2126,  ..., -0.4574,  0.8882, -0.4670],
     [ 1.0756,  0.7875, -0.5661,  ..., -0.0712,  0.4537, -0.8165],
     [ 0.6386,  0.0840, -0.6347,  ..., -0.0147,  0.8415, -0.4670],
     ...,
     [ 0.5548,  0.1655, -0.7804,  ...,  0.0193,  0.3028, -0.9989],
     [ 0.9167,  0.4904, -1.2653,  ..., -0.2235,  0.1046, -1.2234],
     [ 0.2652,  0.2070, -0.2308,  ..., -0.4738,  0.8749, -0.4810]]])

total load and Inference time : 1.214806 seconds
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Inference time: 0.060446 seconds

original bge onnx model logits:
Embeddings: tensor([[[ 0.0307, 0.3419, -0.5840, ..., -0.6412, 0.6155, -0.6777],
[ 0.6904, 0.7553, -0.9444, ..., -0.2511, 0.5448, -1.1094],
[ 0.4891, -0.0739, -0.9083, ..., -0.3734, 0.8741, -0.6356],
...,
[ 0.3115, 0.2182, -0.8625, ..., -0.2178, 0.2122, -1.5642],
[ 0.0301, 0.0755, -1.0148, ..., -0.1433, 0.4683, -0.7805],
[ 0.0312, 0.3422, -0.5849, ..., -0.6419, 0.6148, -0.6777]],

    [[ 0.3169,  0.0871, -0.3294,  ..., -0.5903,  0.8581, -0.5273],
     [ 1.1735,  0.7043, -0.7917,  ..., -0.0730,  0.3665, -0.8104],
     [ 0.8842, -0.1043, -0.7464,  ..., -0.2039,  0.8116, -0.4887],
     ...,
     [ 0.6671, -0.0624, -0.8531,  ..., -0.1373,  0.4128, -1.2478],
     [ 1.0800,  0.2469, -1.4367,  ..., -0.2730,  0.0334, -1.3879],
     [ 0.3173,  0.0875, -0.3308,  ..., -0.5911,  0.8573, -0.5275]]])

Expected behavior

     quantized model should also be like this :
     Embeddings: tensor([[[ 0.0307,  0.3419, -0.5840,  ..., -0.6412,  0.6155, -0.6777],
     [ 0.6904,  0.7553, -0.9444,  ..., -0.2511,  0.5448, -1.1094],
     [ 0.4891, -0.0739, -0.9083,  ..., -0.3734,  0.8741, -0.6356],
     ...,
     [ 0.3115,  0.2182, -0.8625,  ..., -0.2178,  0.2122, -1.5642],
     [ 0.0301,  0.0755, -1.0148,  ..., -0.1433,  0.4683, -0.7805],
     [ 0.0312,  0.3422, -0.5849,  ..., -0.6419,  0.6148, -0.6777]],

    [[ 0.3169,  0.0871, -0.3294,  ..., -0.5903,  0.8581, -0.5273],
     [ 1.1735,  0.7043, -0.7917,  ..., -0.0730,  0.3665, -0.8104],
     [ 0.8842, -0.1043, -0.7464,  ..., -0.2039,  0.8116, -0.4887],
     ...,
     [ 0.6671, -0.0624, -0.8531,  ..., -0.1373,  0.4128, -1.2478],
     [ 1.0800,  0.2469, -1.4367,  ..., -0.2730,  0.0334, -1.3879],
     [ 0.3173,  0.0875, -0.3308,  ..., -0.5911,  0.8573, -0.5275]]]

so ,is there anyway to speed the inference process on already faster onnx model while mantain performace with quantization? thank u !

The text was updated successfully, but these errors were encountered:

chuangzhidan added the bug Something isn't working label Jan 2, 2025

chuangzhidan changed the title ~~cannot quantize bge onnx model((embedding model) without performace loss~~ cannot quantize bge onnx model(embedding model) without performace loss Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot quantize bge onnx model(embedding model) without performace loss #2145

cannot quantize bge onnx model(embedding model) without performace loss #2145

chuangzhidan commented Jan 2, 2025 •

edited

Loading

cannot quantize bge onnx model(embedding model) without performace loss #2145

cannot quantize bge onnx model(embedding model) without performace loss #2145

Comments

chuangzhidan commented Jan 2, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Load PyTorch model and convert to ONNX

Quantize the model

Expected behavior

chuangzhidan commented Jan 2, 2025 •

edited

Loading