issue in quantised model generated response #450

ragesh2000 · 2024-07-23T18:46:06Z

iam referring to https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/smooth_quant for quantising llama chat model and then do an innference on it. I have successfully created a quantised version however the response from the model is not satysfying. I have provided the code snippet that iam using to do inference.

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import pipeline, LlamaTokenizer
import torch

onnx_path = "./onnx_q/"
opt_model = ORTModelForCausalLM.from_pretrained(onnx_path, file_name="model.onnx").to('cuda')
tokenizer = LlamaTokenizer.from_pretrained(onnx_path)
opt_optimum_generator = pipeline("text-generation", model=opt_model, tokenizer=tokenizer, device='cuda')
prompt = "what is ai ?"
generated_text = opt_optimum_generator(prompt, max_length=254, num_return_sequences=1, truncation=True)
print(generated_text[0]['generated_text'])

Am I doing something wromg ?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue in quantised model generated response #450

issue in quantised model generated response #450

ragesh2000 commented Jul 23, 2024 •

edited

Loading

issue in quantised model generated response #450

issue in quantised model generated response #450

Comments

ragesh2000 commented Jul 23, 2024 • edited Loading

ragesh2000 commented Jul 23, 2024 •

edited

Loading