You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
total load and Inference time : 1.214806 seconds
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Inference time: 0.060446 seconds
chuangzhidan
changed the title
cannot quantize bge onnx model((embedding model) without performace loss
cannot quantize bge onnx model(embedding model) without performace loss
Jan 2, 2025
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
import time
def load_and_infer_onnx_model(save_dir, sentences):
tokenizer = AutoTokenizer.from_pretrained(save_dir)
model_ort = ORTModelForFeatureExtraction.from_pretrained(save_dir, file_name="model.onnx")
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
start_time = time.time()
with torch.no_grad():
model_output_ort = model_ort(**encoded_input)
end_time = time.time()
print(f"Inference time : {end_time - start_time:.6f} seconds")
return model_output_ort['last_hidden_state']
save_dir = '/media/data/llm/bge-large-zh-onnx'
sentences = ["样例数据-1", "样例数据-2"] # or any other sentences
start_time = time.time()
output = load_and_infer_onnx_model(save_dir, sentences)
end_time = time.time()
print("Inference output:", output)
print(f"total load and Inference time : {end_time - start_time:.6f} seconds")
from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
from optimum.onnxruntime.configuration import AutoQuantizationConfig
Load PyTorch model and convert to ONNX
quantizer = ORTQuantizer.from_pretrained(save_dir)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
Quantize the model
start_time = time.time()
model_quantized_path = quantizer.quantize(
save_dir="/media/data/llm/bge-large-zh-quant",
quantization_config=dqconfig,
# model_file_name="model.onnx",
file_suffix=""
)
end_time = time.time()
print(f"done!time spent: {end_time - start_time:.6f} seconds")
start_time = time.time()
output = load_and_infer_onnx_model("/media/data/llm/bge-large-zh-quant/", sentences)
end_time = time.time()
print("Inference output:", output)
print(f"total load and Inference time : {end_time - start_time:.6f} seconds")
this is the quantized onnx model output ,1/4 size of the origin onnx model. twice faster in inference but not the same output;s logits.
Inference output: tensor([[[ 0.0172, 0.3434, -0.6531, ..., -0.5857, 0.3458, -0.5848],
[ 0.6218, 0.8631, -0.9589, ..., -0.2100, 0.4544, -1.0890],
[ 0.3047, -0.0212, -0.9248, ..., -0.1990, 0.7436, -0.6348],
...,
[ 0.2699, 0.2679, -0.9100, ..., -0.0953, 0.0020, -1.3923],
[-0.0181, 0.1850, -0.9733, ..., -0.0529, 0.3307, -0.8774],
[-0.0032, 0.3302, -0.6237, ..., -0.5871, 0.4097, -0.6030]],
total load and Inference time : 1.214806 seconds
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Inference time: 0.060446 seconds
original bge onnx model logits:
Embeddings: tensor([[[ 0.0307, 0.3419, -0.5840, ..., -0.6412, 0.6155, -0.6777],
[ 0.6904, 0.7553, -0.9444, ..., -0.2511, 0.5448, -1.1094],
[ 0.4891, -0.0739, -0.9083, ..., -0.3734, 0.8741, -0.6356],
...,
[ 0.3115, 0.2182, -0.8625, ..., -0.2178, 0.2122, -1.5642],
[ 0.0301, 0.0755, -1.0148, ..., -0.1433, 0.4683, -0.7805],
[ 0.0312, 0.3422, -0.5849, ..., -0.6419, 0.6148, -0.6777]],
Expected behavior
so ,is there anyway to speed the inference process on already faster onnx model while mantain performace with quantization? thank u !
The text was updated successfully, but these errors were encountered: