Skip to content

sycl: Add reorder to Q6_K mmvq implementation #13885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

s-Nick
Copy link
Collaborator

@s-Nick s-Nick commented May 29, 2025

This PR implements quants reordering for mmvq for Q6_K quantization following the work done for Q4_0 and Q4_K.
These changes give good results on BMG and do not detriment performance on other GPUs.

Performance impact

All numbers taken with GGML_SYCL_DISABLE_OPT=0 .

Battlemage B580

model size params backend ngl sm mmap test this PR t/s master(26b79b6) t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 7421.88 ± 40.31 7303.25 ± 190.33
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 134.93 ± 4.28 132.17 ± 7.13
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 pp512 7508.23 ± 12.32 7543.68 ± 52.94
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 tg128 124.74 ± 2.57 117.75 ± 2.95
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 pp512 2164.47 ± 3.86 2156.43 ± 4.42
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 tg128 65.13 ± 0.55 65.65 ± 0.38
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 pp512 2206.63 ± 4.49 2202.15 ± 5.80
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 tg128 55.04 ± 0.19 52.57 ± 0.23
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 pp512 5722.94 ± 33.50 5688.02 ± 29.39
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 tg128 93.32 ± 2.52 88.34 ± 2.42
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 pp512 3043.55 ± 8.11 3047.39 ± 6.17
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 tg128 95.46 ± 2.23 95.79 ± 2.39
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 pp512 3140.52 ± 6.44 3154.81 ± 5.22
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 tg128 72.71 ± 0.89 69.35 ± 0.16
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 pp512 1472.53 ± 4.66 1468.04 ± 0.47
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 tg128 23.48 ± 0.03 20.30 ± 0.05

Lunar Lake

model size params backend ngl test this PR t/s master(26b79b6) t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 pp512 1559.52 ± 38.29 1761.94 ± 15.09
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 tg128 58.17 ± 0.59 56.42 ± 0.42
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 pp512 1775.38 ± 37.60 1675.54 ± 32.04
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 tg128 42.21 ± 0.19 39.81 ± 0.65
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 pp512 391.97 ± 4.96 433.80 ± 1.21
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 tg128 21.61 ± 0.51 20.38 ± 0.61
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 pp512 492.83 ± 0.55 488.92 ± 1.28
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 tg128 15.86 ± 0.24 14.98 ± 0.16
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 pp512 985.69 ± 62.63 990.18 ± 10.61
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 tg128 29.24 ± 0.15 27.48 ± 0.25
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 pp512 674.42 ± 0.61 665.08 ± 3.18
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 tg128 34.34 ± 0.09 33.52 ± 0.11
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 pp512 743.15 ± 1.86 737.78 ± 2.94
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 tg128 22.53 ± 0.48 22.17 ± 0.30
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 pp512 301.50 ± 14.06 302.83 ± 1.09
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 tg128 7.72 ± 0.05 6.06 ± 0.02
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 pp512 411.81 ± 4.68 418.31 ± 5.50
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 tg128 14.46 ± 0.06 13.40 ± 0.09

Intel Arc A770

model size params backend ngl sm mmap test t/s master(26b79b6) t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 4456.48 ± 8.69 4433.16 ± 9.59
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 45.60 ± 0.21 44.82 ± 0.24
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 pp512 4499.64 ± 3.28 4460.18 ± 5.36
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 tg128 44.60 ± 0.16 43.98 ± 0.14
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 pp512 1716.17 ± 1.29 1707.70 ± 1.18
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 tg128 34.40 ± 0.03 34.06 ± 0.02
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 pp512 1732.65 ± 2.73 1723.82 ± 1.31
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 tg128 32.60 ± 0.27 31.71 ± 0.27
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 pp512 3661.03 ± 6.91 3632.01 ± 6.45
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 tg128 39.20 ± 0.31 38.16 ± 0.36
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 pp512 2464.93 ± 2.58 2445.01 ± 2.54
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 tg128 39.98 ± 0.01 39.41 ± 0.33
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 pp512 2513.13 ± 2.47 2492.01 ± 1.80
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 tg128 34.50 ± 0.30 34.14 ± 0.02
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 pp512 1031.68 ± 1.04 1024.74 ± 1.07
llama 34B Q6_K 8.20 GiB 10.73 B SYCL 99 none 0 tg128 17.33 ± 0.11 15.12 ± 0.16

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 29, 2025
Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Most of my comments are minor things or topics for discussing.


auto * ql_ptr = data_device;
auto * qh_ptr = ql_ptr + (QK_K / 2) * nblocks;
// scales are after all quants' bits so adding both to get correct offset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment describes the reordered structure a bit. Down below you have a similar comment for the high and low bits. I suggest having both in the same place if we want to keep these.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for spotting it, I think a better place to keep this comment is inside the struct where the offset is computed. I am going to remove it from here.

using q6_k_block = ggml_sycl_reordered::block_q_t<GGML_TYPE_Q6_K>;
using q6_k_traits = typename q6_k_block::traits;

// contiguous v/x values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left there from original vec_dot_q6_K_q8_1_impl_mmvq, It is valid for all K quantization and it simply means that v value it uses are contiguous. In retrospective, it is a bit cryptic and redundant, I'll remove it.


float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
const std::pair<int, int> d_offset, const block_q8_1 * __restrict__ bq8_1, const int & iqs,
int /* n_blocks */) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you found a way to get rid of n_blocks and q6_K is quite similar to q4_K, do you think it's feasible to remove it also from q4_K so we reduce the function signature?

It's not really part of the PR, so if this would require much more work we can add a TODO to deal with that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible. I read the logic for q4_k and n_blocks is used only to compute the position after all qs. To me it looks like we could put both block-scales and super-block scales in d_offset pair and compute them only once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a small enough change? If not we can refactor as part of a different PR.

@@ -35,9 +35,8 @@ static void mul_mat_vec_q_reorder(const void * __restrict__ vx, const void * __r
for (int i = sg.get_local_linear_id() / block_elements_per_subgroup; i < blocks_per_row; i += blocks_per_subgroup) {
const int ibx = row * blocks_per_row + i; // x block index
// TODO: Generalize offsets, right now only works for quantizations that don't split high and low bits
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR deals with this TODO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it too, but I wasn't sure I covered all the possible cases. If you are, I am happy to remove the comment.

const int8_t * scales = reinterpret_cast<const int8_t *>(base + d_offset.first);
const ggml_half * d = (const ggml_half *) (base + d_offset.second) + ib;

const int bq8_offset = 2 * QR6_K * (iqs / (QI6_K / 2)) + (iqs % (QI6_K / 2)) / (QI6_K / 4);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion:

block traits (traits::qk and such) were introduced to not have the QIK_K, QK_K and such macros lying around. Are we all happy with having the generic traits only in the mmvq entrypoint (mul_mat_vec_q_reorder)?

I used them in Q4_0, but that case has much simpler quant/dequantize algorithms. Just double checking that this is a conscious choice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While computing the offset I found more intuitive to use macros. I don't mind changing it as long as the SYCL backend style is consistent.(also Q4_K reorder still uses macros).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the opposite, to leave them like that. It seems that the macros are shorter and easier to read, but wanted to see what others thought about it.

@s-Nick s-Nick force-pushed the mmvq_q6_k_reorder branch from 3ec8eb3 to 6d0c2d8 Compare June 3, 2025 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants