Performance on AMD (gfx1030)? #562

jeroen-mostert · 2024-07-22T22:07:04Z

jeroen-mostert
Jul 22, 2024

I tried ExLlamav2 to see how well it works for AMD, since I'd seen discussions here and there that it might be faster than llama.cpp in some cases, since llama.cpp is not particularly tuned for AMD.

I tried a simple test for Phi3, no quantization. For llama.cpp I converted it to a .gguf and ran llama-bench on it, resulting in a speed of ~42 tokens/sec on my RX 6800 (aka gfx1030).

For exllamav2 I cloned the repo and did a pip install -U . -vvv. The compile commands look OK and are enabling the right offloads. However, a test with python examples/inference.py gives a token generation speed of only 18 tokens/sec -- which is bad both in absolute and relative sense. torch.cuda.is_available() is True and torch.cuda_get_arch_list() gives ['gfx1030', 'gfx1036', 'gfx1102'] (which is indeed what I've built ROCm for -- this is ROCm 6.1.2 built with https://github.com/lamikr/rocm_sdk_builder. For fun I also tried installing the global version using pip install, but this has the same effect as it just compiles the same thing on the fly.

Before I do a deep dive into this, I'd like a sanity check. Am I doing something wrong? Comparing the wrong things? Is this performance expected? Do other people get better speeds? If I do dive into this, would it be even be a EXLv2 issue or am I more likely to find the issue in Pytorch itself?

jeroen-mostert · 2024-07-22T23:05:45Z

jeroen-mostert
Jul 22, 2024
Author

As an addendum, if I quantify the model to IQ4_XS for llama.cpp and a 4.0 bpw target for EXLv2, the results are closer, but still favor lcpp: 112 t/s vs. 93.5 t/s. The EXLv2 quantized model is slightly smaller, but eats more memory during evaluation: ~4300 MB vs. ~2400 MB for lcpp, which is about the file size. The fact that EXLv2 needs double that furthermore suggests I (or the test) might be doing something wrong.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Performance on AMD (gfx1030)? #562

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Performance on AMD (gfx1030)? #562

Uh oh!

jeroen-mostert Jul 22, 2024

Replies: 1 comment

Uh oh!

Uh oh!

jeroen-mostert Jul 22, 2024 Author

jeroen-mostert
Jul 22, 2024

jeroen-mostert
Jul 22, 2024
Author