Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference super slow #15

Open
SinanAkkoyun opened this issue May 29, 2023 · 4 comments
Open

Inference super slow #15

SinanAkkoyun opened this issue May 29, 2023 · 4 comments

Comments

@SinanAkkoyun
Copy link

Hello, I only get maybe one token/second whereas I get 30 tokens/second with the default pytorch implementation (running on an H100)

@DungMinhDao
Copy link

I guess you can try inference with GPU, after making some modifications to the code:

llama/memory_pool.py:        self.sess = ort.InferenceSession(onnxfile, providers=['CUDAExecutionProvider'])

Find all the files with import onnxruntime and add import torch before it.
Also remember to uninstall onnxruntime and install onnxruntime-gpu instead.
Note: it takes 34GB GPU memory for me to load the model, but the inference is fast.

@SinanAkkoyun
Copy link
Author

I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!

@DungMinhDao
Copy link

I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much!

I ran the 7B model downloaded from the repo’s given link. About 0.2 token/s for CPU and 20 for GPU

@tpoisonooo
Copy link
Owner

1B needs 4GB memory with float32 format. It is really hard to inference fastly on single CPU.

If you want performance on mobile/laptop CPU, try InferLLM repo https://github.com/MegEngine/InferLLM
For model conversion to NPU/DSP, use llama.onnx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants