Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue in quantised model generated response #450

Open
ragesh2000 opened this issue Jul 23, 2024 · 0 comments
Open

issue in quantised model generated response #450

ragesh2000 opened this issue Jul 23, 2024 · 0 comments

Comments

@ragesh2000
Copy link

ragesh2000 commented Jul 23, 2024

iam referring to https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/smooth_quant for quantising llama chat model and then do an innference on it. I have successfully created a quantised version however the response from the model is not satysfying. I have provided the code snippet that iam using to do inference.

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import pipeline, LlamaTokenizer
import torch

onnx_path = "./onnx_q/"
opt_model = ORTModelForCausalLM.from_pretrained(onnx_path, file_name="model.onnx").to('cuda')
tokenizer = LlamaTokenizer.from_pretrained(onnx_path)
opt_optimum_generator = pipeline("text-generation", model=opt_model, tokenizer=tokenizer, device='cuda')
prompt = "what is ai ?"
generated_text = opt_optimum_generator(prompt, max_length=254, num_return_sequences=1, truncation=True)
print(generated_text[0]['generated_text']) 

Am I doing something wromg ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant