Performance on AMD (gfx1030)? #562
Unanswered
jeroen-mostert
asked this question in
Q&A
Replies: 1 comment
-
As an addendum, if I quantify the model to |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I tried ExLlamav2 to see how well it works for AMD, since I'd seen discussions here and there that it might be faster than llama.cpp in some cases, since llama.cpp is not particularly tuned for AMD.
I tried a simple test for Phi3, no quantization. For llama.cpp I converted it to a .gguf and ran llama-bench on it, resulting in a speed of ~42 tokens/sec on my RX 6800 (aka gfx1030).
For exllamav2 I cloned the repo and did a
pip install -U . -vvv
. The compile commands look OK and are enabling the right offloads. However, a test withpython examples/inference.py
gives a token generation speed of only 18 tokens/sec -- which is bad both in absolute and relative sense.torch.cuda.is_available()
isTrue
andtorch.cuda_get_arch_list()
gives['gfx1030', 'gfx1036', 'gfx1102']
(which is indeed what I've built ROCm for -- this is ROCm 6.1.2 built with https://github.com/lamikr/rocm_sdk_builder. For fun I also tried installing the global version usingpip install
, but this has the same effect as it just compiles the same thing on the fly.Before I do a deep dive into this, I'd like a sanity check. Am I doing something wrong? Comparing the wrong things? Is this performance expected? Do other people get better speeds? If I do dive into this, would it be even be a EXLv2 issue or am I more likely to find the issue in Pytorch itself?
Beta Was this translation helpful? Give feedback.
All reactions