Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot quantize bge onnx model(embedding model) without performace loss #2145

Open
1 of 4 tasks
chuangzhidan opened this issue Jan 2, 2025 · 0 comments
Open
1 of 4 tasks
Labels
bug Something isn't working

Comments

@chuangzhidan
Copy link

chuangzhidan commented Jan 2, 2025

System Info

ubuntu:
Python 3.12.4
optimum Version: 1.22.0
A800 GPU

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
import time

def load_and_infer_onnx_model(save_dir, sentences):
tokenizer = AutoTokenizer.from_pretrained(save_dir)
model_ort = ORTModelForFeatureExtraction.from_pretrained(save_dir, file_name="model.onnx")
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
start_time = time.time()
with torch.no_grad():
model_output_ort = model_ort(**encoded_input)
end_time = time.time()
print(f"Inference time : {end_time - start_time:.6f} seconds")
return model_output_ort['last_hidden_state']

save_dir = '/media/data/llm/bge-large-zh-onnx'
sentences = ["样例数据-1", "样例数据-2"] # or any other sentences
start_time = time.time()
output = load_and_infer_onnx_model(save_dir, sentences)
end_time = time.time()
print("Inference output:", output)
print(f"total load and Inference time : {end_time - start_time:.6f} seconds")

from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig

Load PyTorch model and convert to ONNX

quantizer = ORTQuantizer.from_pretrained(save_dir)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)

Quantize the model

start_time = time.time()
model_quantized_path = quantizer.quantize(
save_dir="/media/data/llm/bge-large-zh-quant",
quantization_config=dqconfig,
# model_file_name="model.onnx",
file_suffix=""
)

end_time = time.time()
print(f"done!time spent: {end_time - start_time:.6f} seconds")
start_time = time.time()
output = load_and_infer_onnx_model("/media/data/llm/bge-large-zh-quant/", sentences)
end_time = time.time()
print("Inference output:", output)
print(f"total load and Inference time : {end_time - start_time:.6f} seconds")

this is the quantized onnx model output ,1/4 size of the origin onnx model. twice faster in inference but not the same output;s logits.
Inference output: tensor([[[ 0.0172, 0.3434, -0.6531, ..., -0.5857, 0.3458, -0.5848],
[ 0.6218, 0.8631, -0.9589, ..., -0.2100, 0.4544, -1.0890],
[ 0.3047, -0.0212, -0.9248, ..., -0.1990, 0.7436, -0.6348],
...,
[ 0.2699, 0.2679, -0.9100, ..., -0.0953, 0.0020, -1.3923],
[-0.0181, 0.1850, -0.9733, ..., -0.0529, 0.3307, -0.8774],
[-0.0032, 0.3302, -0.6237, ..., -0.5871, 0.4097, -0.6030]],

    [[ 0.2260,  0.2415, -0.2126,  ..., -0.4574,  0.8882, -0.4670],
     [ 1.0756,  0.7875, -0.5661,  ..., -0.0712,  0.4537, -0.8165],
     [ 0.6386,  0.0840, -0.6347,  ..., -0.0147,  0.8415, -0.4670],
     ...,
     [ 0.5548,  0.1655, -0.7804,  ...,  0.0193,  0.3028, -0.9989],
     [ 0.9167,  0.4904, -1.2653,  ..., -0.2235,  0.1046, -1.2234],
     [ 0.2652,  0.2070, -0.2308,  ..., -0.4738,  0.8749, -0.4810]]])

total load and Inference time : 1.214806 seconds
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Inference time: 0.060446 seconds

original bge onnx model logits:
Embeddings: tensor([[[ 0.0307, 0.3419, -0.5840, ..., -0.6412, 0.6155, -0.6777],
[ 0.6904, 0.7553, -0.9444, ..., -0.2511, 0.5448, -1.1094],
[ 0.4891, -0.0739, -0.9083, ..., -0.3734, 0.8741, -0.6356],
...,
[ 0.3115, 0.2182, -0.8625, ..., -0.2178, 0.2122, -1.5642],
[ 0.0301, 0.0755, -1.0148, ..., -0.1433, 0.4683, -0.7805],
[ 0.0312, 0.3422, -0.5849, ..., -0.6419, 0.6148, -0.6777]],

    [[ 0.3169,  0.0871, -0.3294,  ..., -0.5903,  0.8581, -0.5273],
     [ 1.1735,  0.7043, -0.7917,  ..., -0.0730,  0.3665, -0.8104],
     [ 0.8842, -0.1043, -0.7464,  ..., -0.2039,  0.8116, -0.4887],
     ...,
     [ 0.6671, -0.0624, -0.8531,  ..., -0.1373,  0.4128, -1.2478],
     [ 1.0800,  0.2469, -1.4367,  ..., -0.2730,  0.0334, -1.3879],
     [ 0.3173,  0.0875, -0.3308,  ..., -0.5911,  0.8573, -0.5275]]])

Expected behavior

     quantized model should also be like this :
     Embeddings: tensor([[[ 0.0307,  0.3419, -0.5840,  ..., -0.6412,  0.6155, -0.6777],
     [ 0.6904,  0.7553, -0.9444,  ..., -0.2511,  0.5448, -1.1094],
     [ 0.4891, -0.0739, -0.9083,  ..., -0.3734,  0.8741, -0.6356],
     ...,
     [ 0.3115,  0.2182, -0.8625,  ..., -0.2178,  0.2122, -1.5642],
     [ 0.0301,  0.0755, -1.0148,  ..., -0.1433,  0.4683, -0.7805],
     [ 0.0312,  0.3422, -0.5849,  ..., -0.6419,  0.6148, -0.6777]],

    [[ 0.3169,  0.0871, -0.3294,  ..., -0.5903,  0.8581, -0.5273],
     [ 1.1735,  0.7043, -0.7917,  ..., -0.0730,  0.3665, -0.8104],
     [ 0.8842, -0.1043, -0.7464,  ..., -0.2039,  0.8116, -0.4887],
     ...,
     [ 0.6671, -0.0624, -0.8531,  ..., -0.1373,  0.4128, -1.2478],
     [ 1.0800,  0.2469, -1.4367,  ..., -0.2730,  0.0334, -1.3879],
     [ 0.3173,  0.0875, -0.3308,  ..., -0.5911,  0.8573, -0.5275]]]

so ,is there anyway to speed the inference process on already faster onnx model while mantain performace with quantization? thank u !

@chuangzhidan chuangzhidan added the bug Something isn't working label Jan 2, 2025
@chuangzhidan chuangzhidan changed the title cannot quantize bge onnx model((embedding model) without performace loss cannot quantize bge onnx model(embedding model) without performace loss Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant