-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference super slow #15
Comments
I guess you can try inference with GPU, after making some modifications to the code: llama/memory_pool.py: self.sess = ort.InferenceSession(onnxfile, providers=['CUDAExecutionProvider']) Find all the files with |
I am struggling to get it to run, did you already make it run? Could you please tell me how many tokens/second you get out of the 7b or 13b model? Thank you so much! |
I ran the 7B model downloaded from the repo’s given link. About 0.2 token/s for CPU and 20 for GPU |
1B needs 4GB memory with float32 format. It is really hard to inference fastly on single CPU. If you want performance on mobile/laptop CPU, try InferLLM repo https://github.com/MegEngine/InferLLM |
Hello, I only get maybe one token/second whereas I get 30 tokens/second with the default pytorch implementation (running on an H100)
The text was updated successfully, but these errors were encountered: