Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: multi-row k quants #10846

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

netrunnereve
Copy link
Collaborator

This allows our k quant mat vec shaders to process multiple rows at a time just like mul_mat_vec.comp. It's way faster now and Q4_K_S is catching up to IQ4_NL and Q4_0 on my RX 470.

At this point we might want to consider merging the separate k quant files into mul_mat_vec.comp as they're reusing quite a bit of code, and maybe do some templating using ifdefs to choose the correct dequantization function. That's better left to another PR though.

PR:

model size params backend ngl threads main_gpu sm test t/s
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 100 8 1 none tg128 21.88 ± 0.00
llama 8B Q3_K - Medium 3.74 GiB 8.03 B Vulkan 100 8 1 none tg128 18.89 ± 0.04
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 1 none tg128 27.12 ± 0.12
llama 8B Q5_K - Small 5.21 GiB 8.03 B Vulkan 100 8 1 none tg128 22.55 ± 0.00
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 1 none tg128 20.39 ± 0.00
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   24
2.40 us/run - 117.44 MFLOP/run - 484.49 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   44
7.88 us/run - 117.44 MFLOP/run - 262.22 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   4260 runs -   23
8.10 us/run - 117.44 MFLOP/run - 493.23 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   31
5.13 us/run - 117.44 MFLOP/run - 372.68 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   36
8.04 us/run - 117.44 MFLOP/run - 319.10 GFLOPS

Master:

model size params backend ngl threads test t/s
llama 8B Q2_K - Medium 2.95 GiB 8.03 B Vulkan 100 8 tg128 17.66 ± 0.07
llama 8B Q3_K - Medium 3.74 GiB 8.03 B Vulkan 100 8 tg128 15.74 ± 0.02
llama 8B Q4_K - Small 4.36 GiB 8.03 B Vulkan 100 8 tg128 20.58 ± 0.00
llama 8B Q5_K - Small 5.21 GiB 8.03 B Vulkan 100 8 tg128 16.04 ± 0.01
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 tg128 17.57 ± 0.06
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   39
1.03 us/run - 117.44 MFLOP/run - 300.33 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   52
8.63 us/run - 117.44 MFLOP/run - 222.16 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   37
4.13 us/run - 117.44 MFLOP/run - 313.90 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   47
2.77 us/run - 117.44 MFLOP/run - 248.41 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   46
1.75 us/run - 117.44 MFLOP/run - 254.34 GFLOPS

The number of rows used was chosen for my card and may need tuning for different architectures.

@netrunnereve netrunnereve requested a review from 0cc4m December 16, 2024 03:59
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 16, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Dec 16, 2024

Please rebase to reduce the number of commits.

@jeffbolznv
Copy link
Collaborator

Multiple rows is a bit slower on RTX 4070, so please change to one row for NVIDIA:

before:

| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |  1 |         tg128 |        114.99 ± 2.51 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |  1 |         tg128 |        118.99 ± 0.70 |

after:

| llama 3B Q5_K - Medium         |   2.16 GiB |     3.21 B | Vulkan     | 1000 |  1 |         tg128 |        114.69 ± 1.86 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |  1 |         tg128 |        115.25 ± 1.42 |

I read through the shader changes and they look good to me.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 17, 2024

Intel is being weird again..
MASTER:

MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -    98.25 us/run - 117.44 MFLOP/run -   1.20 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   425.65 us/run - 117.44 MFLOP/run - 275.91 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   8520 runs -   124.92 us/run - 117.44 MFLOP/run - 940.15 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   152.10 us/run - 117.44 MFLOP/run - 772.15 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   545.88 us/run - 117.44 MFLOP/run - 215.14 GFLOPS

PR:

MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   140.19 us/run - 117.44 MFLOP/run - 837.73 GFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   361.53 us/run - 117.44 MFLOP/run - 324.84 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   141.64 us/run - 117.44 MFLOP/run - 829.16 GFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   6816 runs -   160.36 us/run - 117.44 MFLOP/run - 732.34 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   376.82 us/run - 117.44 MFLOP/run - 311.66 GFLOPS

With 1*rm instead of 2*rm (equivalent to rm=1, which was not good for the legacy quants):

MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  11928 runs -    84.55 us/run - 117.44 MFLOP/run -   1.39 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   3408 runs -   322.00 us/run - 117.44 MFLOP/run - 364.73 GFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -    99.47 us/run - 117.44 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   7668 runs -   138.77 us/run - 117.44 MFLOP/run - 846.27 GFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                   2556 runs -   565.67 us/run - 117.44 MFLOP/run - 207.61 GFLOPS

It seems to prefer fewer rows on q2_k to q5_k and more rows on q6_k (but performance is bad there either way). I tested this with models for Q4_K_S and Q6_K and it confirms the findings.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 17, 2024

I can also confirm that 1*rm (fewer rows) is better on Nvidia RTX 3090.

The PR looks good, it just needs some changes to the selection logic. It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel. The merge conflict needs to be fixed, too.

Edit: Also looks good on AMD RX 6800 XT.

@netrunnereve
Copy link
Collaborator Author

It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel.

Considering there's a 50% difference in Q6_K performance for Intel I've added a separate variable for it, along with Q8_0 which is also a special case. If there are other quants that don't work well with certain GPUs we can also add them to the list.

BTW have you checked the assembly dump for Intel? I have a feeling that it doesn't like certain memory access patterns and splits those up into a bunch of small loads. Maybe you could try loading each superblock into shared memory first before doing the actual dequantizing.

@netrunnereve
Copy link
Collaborator Author

Edit: Also looks good on AMD RX 6800 XT.

Does that mean it works best with two rows per shader?

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 19, 2024

It's probably not worth complicating for Intel Q6_K, so let's just stick to fewer rows for k-quants on Nvidia and Intel.

Considering there's a 50% difference in Q6_K performance for Intel I've added a separate variable for it, along with Q8_0 which is also a special case. If there are other quants that don't work well with certain GPUs we can also add them to the list.

It's a big difference, but performance is marginal either way. I would prefer not making it more complex cause it increases the number of parameters we need to hand-tune. Maybe it's time for an optimizer.

BTW have you checked the assembly dump for Intel? I have a feeling that it doesn't like certain memory access patterns and splits those up into a bunch of small loads. Maybe you could try loading each superblock into shared memory first before doing the actual dequantizing.

No, I don't have that much time to devote to Intel.

Edit: Also looks good on AMD RX 6800 XT.

Does that mean it works best with two rows per shader?

I meant the PR got optimal performance on it already.

@netrunnereve
Copy link
Collaborator Author

This should be it I think:

Default: 2 rows for old quants, 1 row for Q8_0 and K quants
AMD GCN: 4 rows for old quants, 2 rows for Q8_0, 4 rows for K quants
AMD RDNA: 2 rows for old quants, 1 row for Q8_0, 2 rows for K quants
Intel: 4 rows for old quants, 2 rows for Q8_0, 2 rows for K quants

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 21, 2024

I see a significant drop in performance on Nvidia RTX 3090 in tg for a Q4_K_S 8B model:

Master:
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         88.64 ± 0.17 |
PR:
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         77.01 ± 0.43 |
With rm_kq = 2:
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     |  99 |         tg128 |         99.12 ± 2.98 |

This doesn't match with the results from @jeffbolznv. Any theories?

Edit: Some more data:

Master:
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  16188 runs -    62.08 us/run - 117.44 MFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -    98.64 us/run - 117.44 MFLOP/run -   1.19 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  15336 runs -    65.49 us/run - 117.44 MFLOP/run -   1.79 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  14484 runs -    73.18 us/run - 117.44 MFLOP/run -   1.60 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  13632 runs -    76.95 us/run - 117.44 MFLOP/run -   1.53 TFLOPS

PR:
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  16188 runs -    62.29 us/run - 117.44 MFLOP/run -   1.89 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -    99.56 us/run - 117.44 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  15336 runs -    65.86 us/run - 117.44 MFLOP/run -   1.78 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  13632 runs -    76.61 us/run - 117.44 MFLOP/run -   1.53 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  13632 runs -    77.54 us/run - 117.44 MFLOP/run -   1.51 TFLOPS

With rm_kq = 2:
  MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  17892 runs -    56.23 us/run - 117.44 MFLOP/run -   2.09 TFLOPS
  MUL_MAT(type_a=q3_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  10224 runs -    99.74 us/run - 117.44 MFLOP/run -   1.18 TFLOPS
  MUL_MAT(type_a=q4_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  17892 runs -    57.15 us/run - 117.44 MFLOP/run -   2.05 TFLOPS
  MUL_MAT(type_a=q5_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  16188 runs -    64.40 us/run - 117.44 MFLOP/run -   1.82 TFLOPS
  MUL_MAT(type_a=q6_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):                  13632 runs -    73.95 us/run - 117.44 MFLOP/run -   1.59 TFLOPS

@jeffbolznv
Copy link
Collaborator

I just reran and am seeing a small improvement from rm_kq=2:

master:
Meta-Llama-3-8B-Instruct-Q4_K_S.gguf
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         74.62  0.72 |
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         74.81  0.40 |
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         74.94  0.40 |
Phi-3-mini-4k-instruct-q4.gguf
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        112.03  0.55 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        112.13  0.13 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        111.80  1.26 |

PR:
Meta-Llama-3-8B-Instruct-Q4_K_S.gguf
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         75.64  0.54 |
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         75.94  0.30 |
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         76.02  0.48 |
Phi-3-mini-4k-instruct-q4.gguf
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        109.20  1.25 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        109.23  1.18 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        110.12  0.78 |

rm_kq=2:
Meta-Llama-3-8B-Instruct-Q4_K_S.gguf
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         75.09  2.04 |
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         76.39  0.56 |
| llama 8B Q4_K - Small          |   4.36 GiB |     8.03 B | Vulkan     | 1000 |         tg128 |         74.84  0.71 |
Phi-3-mini-4k-instruct-q4.gguf
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        113.10  0.49 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        113.89  0.23 |
| phi3 3B Q4_K - Medium          |   2.23 GiB |     3.82 B | Vulkan     | 1000 |         tg128 |        113.77  0.19 |

(absolute numbers a bit different from last time because I didn't use -fa 1 this time). So I'm OK with rm_kq=2, I guess. Not sure what has changed, maybe just having better luck now with how the compiler schedules things?

@netrunnereve
Copy link
Collaborator Author

This doesn't match with the results from @jeffbolznv. Any theories?

I just reran and am seeing a small improvement from rm_kq=2:

Was there a driver update recently? That's the only thing I can come up with considering how both of you mentioned earlier that 1 row was faster on Nvidia. Anyways I've updated the code to use rm_kq=2 by default.

@netrunnereve
Copy link
Collaborator Author

Just noticed this but the tests look strange:

MUL_MAT(type_a=q2_K,type_b=f32,m=4096,n=1,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3]):

If m, n, and k are what I think they are (m*n A matrix and n*k B matrix) then this isn't a standard matrix vector multiplication anymore but rather the product of a column vector and a row vector that will spit out a 4096*14336 matrix as the output. Since our mul_mat_vec functions don't handle this case I think I'm reading the printout wrong 🤷‍♀️.

The reason why I'm asking this is because it should be possible to calculate a reasonably optimal row count for your GPU depending on the matrix size. Like for my RX 470:

64 wide SIMD * 32 cores = 2048 threads minimum to fill up the GPU (realistically you need way more so it can switch to a different subgroup if one gets delayed by memory or something)
64 wide SIMD * 40 subgroups to choose from * 32 cores = 81920 maximum threads at once

If our A matrix has 4096 rows and we're multiplying it against a B vector of size 4096 we'll only generate 4096 threads if rm_kq=1. For something like this we'll probably use split k to generate more threads, though I'm not sure how that algorithm works. It might be possible to have a target thread count depending on the architecture and core count instead of doing all this experimenting or having an autotuner.

@jeffbolznv
Copy link
Collaborator

The multiply is MxK * KxN -> MxN. These shaders assign one workgroup to each result row (really each result element, because N==1), and that workgroup computes a dot product with K components where each invocation in the workgroup does a subset of the dot product and then they all add up the partial sums at the end. So it's 4096 workgroups in this test, which should be enough to fill the machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants