sycl: Add reorder to Q6_K mmvq implementation #13885

s-Nick · 2025-05-29T10:31:53Z

This PR implements quants reordering for mmvq for Q6_K quantization following the work done for Q4_0 and Q4_K.
These changes give good results on BMG and do not detriment performance on other GPUs.

Performance impact

All numbers taken with GGML_SYCL_DISABLE_OPT=0 .

Battlemage B580

model	size	params	backend	ngl	sm	test	this PR t/s	master(`26b79b6`) t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	7421.88 ± 40.31	7303.25 ± 190.33
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	134.93 ± 4.28	132.17 ± 7.13
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	7508.23 ± 12.32	7543.68 ± 52.94
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	124.74 ± 2.57	117.75 ± 2.95
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	2164.47 ± 3.86	2156.43 ± 4.42
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	65.13 ± 0.55	65.65 ± 0.38
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	2206.63 ± 4.49	2202.15 ± 5.80
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	55.04 ± 0.19	52.57 ± 0.23
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	5722.94 ± 33.50	5688.02 ± 29.39
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	93.32 ± 2.52	88.34 ± 2.42
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	3043.55 ± 8.11	3047.39 ± 6.17
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	95.46 ± 2.23	95.79 ± 2.39
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	3140.52 ± 6.44	3154.81 ± 5.22
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	72.71 ± 0.89	69.35 ± 0.16
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	pp512	1472.53 ± 4.66	1468.04 ± 0.47
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	tg128	23.48 ± 0.03	20.30 ± 0.05

Lunar Lake

model	size	params	backend	ngl	test	this PR t/s	master(`26b79b6`) t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	pp512	1559.52 ± 38.29	1761.94 ± 15.09
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	tg128	58.17 ± 0.59	56.42 ± 0.42
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	pp512	1775.38 ± 37.60	1675.54 ± 32.04
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	tg128	42.21 ± 0.19	39.81 ± 0.65
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	pp512	391.97 ± 4.96	433.80 ± 1.21
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	tg128	21.61 ± 0.51	20.38 ± 0.61
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	pp512	492.83 ± 0.55	488.92 ± 1.28
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	tg128	15.86 ± 0.24	14.98 ± 0.16
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	pp512	985.69 ± 62.63	990.18 ± 10.61
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	tg128	29.24 ± 0.15	27.48 ± 0.25
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	pp512	674.42 ± 0.61	665.08 ± 3.18
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	tg128	34.34 ± 0.09	33.52 ± 0.11
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	pp512	743.15 ± 1.86	737.78 ± 2.94
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	tg128	22.53 ± 0.48	22.17 ± 0.30
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	pp512	301.50 ± 14.06	302.83 ± 1.09
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	tg128	7.72 ± 0.05	6.06 ± 0.02
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	SYCL	99	pp512	411.81 ± 4.68	418.31 ± 5.50
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	SYCL	99	tg128	14.46 ± 0.06	13.40 ± 0.09

Intel Arc A770

model	size	params	backend	ngl	sm	test	t/s	master(`26b79b6`) t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	4456.48 ± 8.69	4433.16 ± 9.59
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	45.60 ± 0.21	44.82 ± 0.24
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	4499.64 ± 3.28	4460.18 ± 5.36
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	44.60 ± 0.16	43.98 ± 0.14
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	1716.17 ± 1.29	1707.70 ± 1.18
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	34.40 ± 0.03	34.06 ± 0.02
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	1732.65 ± 2.73	1723.82 ± 1.31
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	32.60 ± 0.27	31.71 ± 0.27
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	3661.03 ± 6.91	3632.01 ± 6.45
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	39.20 ± 0.31	38.16 ± 0.36
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	2464.93 ± 2.58	2445.01 ± 2.54
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	39.98 ± 0.01	39.41 ± 0.33
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	2513.13 ± 2.47	2492.01 ± 1.80
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	34.50 ± 0.30	34.14 ± 0.02
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	pp512	1031.68 ± 1.04	1024.74 ± 1.07
llama 34B Q6_K	8.20 GiB	10.73 B	SYCL	99	none	tg128	17.33 ± 0.11	15.12 ± 0.16

Alcpz

This looks good. Most of my comments are minor things or topics for discussing.

Alcpz · 2025-06-02T09:28:28Z

ggml/src/ggml-sycl/ggml-sycl.cpp

+
+    auto *       ql_ptr     = data_device;
+    auto *       qh_ptr     = ql_ptr + (QK_K / 2) * nblocks;
+    // scales are after all quants' bits so adding both to get correct offset


This comment describes the reordered structure a bit. Down below you have a similar comment for the high and low bits. I suggest having both in the same place if we want to keep these.

Thank you for spotting it, I think a better place to keep this comment is inside the struct where the offset is computed. I am going to remove it from here.

Alcpz · 2025-06-02T09:30:01Z

ggml/src/ggml-sycl/vecdotq.hpp

+    using q6_k_block  = ggml_sycl_reordered::block_q_t<GGML_TYPE_Q6_K>;
+    using q6_k_traits = typename q6_k_block::traits;
+
+    // contiguous v/x values


Can you explain this comment?

left there from original vec_dot_q6_K_q8_1_impl_mmvq, It is valid for all K quantization and it simply means that v value it uses are contiguous. In retrospective, it is a bit cryptic and redundant, I'll remove it.

Alcpz · 2025-06-02T09:34:32Z

ggml/src/ggml-sycl/vecdotq.hpp

+
+    float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset,
+                     const std::pair<int, int> d_offset, const block_q8_1 * __restrict__ bq8_1, const int & iqs,
+                     int /* n_blocks */) {


Since you found a way to get rid of n_blocks and q6_K is quite similar to q4_K, do you think it's feasible to remove it also from q4_K so we reduce the function signature?

It's not really part of the PR, so if this would require much more work we can add a TODO to deal with that.

I think it is possible. I read the logic for q4_k and n_blocks is used only to compute the position after all qs. To me it looks like we could put both block-scales and super-block scales in d_offset pair and compute them only once.

Is it a small enough change? If not we can refactor as part of a different PR.

Alcpz · 2025-06-02T09:37:13Z

ggml/src/ggml-sycl/mmvq.cpp

@@ -35,9 +35,8 @@ static void mul_mat_vec_q_reorder(const void * __restrict__ vx, const void * __r
    for (int i = sg.get_local_linear_id() / block_elements_per_subgroup; i < blocks_per_row; i += blocks_per_subgroup) {
        const int ibx       = row * blocks_per_row + i;  // x block index
        // TODO: Generalize offsets, right now only works for quantizations that don't split high and low bits


This PR deals with this TODO.

I thought it too, but I wasn't sure I covered all the possible cases. If you are, I am happy to remove the comment.

Alcpz · 2025-06-02T09:45:10Z

ggml/src/ggml-sycl/vecdotq.hpp

+        const int8_t *    scales = reinterpret_cast<const int8_t *>(base + d_offset.first);
+        const ggml_half * d      = (const ggml_half *) (base + d_offset.second) + ib;
+
+        const int bq8_offset   = 2 * QR6_K * (iqs / (QI6_K / 2)) + (iqs % (QI6_K / 2)) / (QI6_K / 4);


Discussion:

block traits (traits::qk and such) were introduced to not have the QIK_K, QK_K and such macros lying around. Are we all happy with having the generic traits only in the mmvq entrypoint (mul_mat_vec_q_reorder)?

I used them in Q4_0, but that case has much simpler quant/dequantize algorithms. Just double checking that this is a conscious choice.

While computing the offset I found more intuitive to use macros. I don't mind changing it as long as the SYCL backend style is consistent.(also Q4_K reorder still uses macros).

I meant the opposite, to leave them like that. It seems that the macros are shorter and easier to read, but wanted to see what others thought about it.

Signed-off-by: nscipione <[email protected]>

github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 29, 2025

Alcpz reviewed Jun 2, 2025

View reviewed changes

Add Reorder to Q6_K mmvq implementation

6d0c2d8

Signed-off-by: nscipione <[email protected]>

s-Nick force-pushed the mmvq_q6_k_reorder branch from 3ec8eb3 to 6d0c2d8 Compare June 3, 2025 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sycl: Add reorder to Q6_K mmvq implementation #13885

sycl: Add reorder to Q6_K mmvq implementation #13885

s-Nick commented May 29, 2025 •

edited

Loading

Uh oh!

Alcpz left a comment

Uh oh!

Alcpz Jun 2, 2025

Uh oh!

s-Nick Jun 4, 2025

Uh oh!

Alcpz Jun 2, 2025

Uh oh!

s-Nick Jun 4, 2025

Uh oh!

Alcpz Jun 2, 2025

Uh oh!

s-Nick Jun 4, 2025

Uh oh!

Alcpz Jun 4, 2025

Uh oh!

Alcpz Jun 2, 2025

Uh oh!

s-Nick Jun 4, 2025

Uh oh!

Alcpz Jun 2, 2025

Uh oh!

s-Nick Jun 4, 2025

Uh oh!

Alcpz Jun 4, 2025

Uh oh!

Uh oh!

sycl: Add reorder to Q6_K mmvq implementation #13885

Are you sure you want to change the base?

sycl: Add reorder to Q6_K mmvq implementation #13885

Conversation

s-Nick commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance impact

Battlemage B580

Lunar Lake

Intel Arc A770

Uh oh!

Alcpz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

s-Nick commented May 29, 2025 •

edited

Loading