We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iam referring to https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/smooth_quant for quantising llama chat model and then do an innference on it. I have successfully created a quantised version however the response from the model is not satysfying. I have provided the code snippet that iam using to do inference.
from optimum.onnxruntime import ORTModelForCausalLM from transformers import pipeline, LlamaTokenizer import torch onnx_path = "./onnx_q/" opt_model = ORTModelForCausalLM.from_pretrained(onnx_path, file_name="model.onnx").to('cuda') tokenizer = LlamaTokenizer.from_pretrained(onnx_path) opt_optimum_generator = pipeline("text-generation", model=opt_model, tokenizer=tokenizer, device='cuda') prompt = "what is ai ?" generated_text = opt_optimum_generator(prompt, max_length=254, num_return_sequences=1, truncation=True) print(generated_text[0]['generated_text'])
Am I doing something wromg ?
The text was updated successfully, but these errors were encountered:
No branches or pull requests
iam referring to https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/language_model/llama/smooth_quant for quantising llama chat model and then do an innference on it. I have successfully created a quantised version however the response from the model is not satysfying. I have provided the code snippet that iam using to do inference.
Am I doing something wromg ?
The text was updated successfully, but these errors were encountered: