-
Notifications
You must be signed in to change notification settings - Fork 12k
sycl: Add reorder to Q6_K mmvq implementation #13885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. Most of my comments are minor things or topics for discussing.
|
||
auto * ql_ptr = data_device; | ||
auto * qh_ptr = ql_ptr + (QK_K / 2) * nblocks; | ||
// scales are after all quants' bits so adding both to get correct offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment describes the reordered structure a bit. Down below you have a similar comment for the high and low bits. I suggest having both in the same place if we want to keep these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for spotting it, I think a better place to keep this comment is inside the struct where the offset is computed. I am going to remove it from here.
using q6_k_block = ggml_sycl_reordered::block_q_t<GGML_TYPE_Q6_K>; | ||
using q6_k_traits = typename q6_k_block::traits; | ||
|
||
// contiguous v/x values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left there from original vec_dot_q6_K_q8_1_impl_mmvq
, It is valid for all K
quantization and it simply means that v
value it uses are contiguous. In retrospective, it is a bit cryptic and redundant, I'll remove it.
ggml/src/ggml-sycl/vecdotq.hpp
Outdated
|
||
float operator()(const void * __restrict__ vbq, const std::pair<int, int> ibx_offset, | ||
const std::pair<int, int> d_offset, const block_q8_1 * __restrict__ bq8_1, const int & iqs, | ||
int /* n_blocks */) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you found a way to get rid of n_blocks and q6_K is quite similar to q4_K, do you think it's feasible to remove it also from q4_K so we reduce the function signature?
It's not really part of the PR, so if this would require much more work we can add a TODO to deal with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible. I read the logic for q4_k and n_blocks
is used only to compute the position after all qs
. To me it looks like we could put both block-scales
and super-block scales
in d_offset
pair and compute them only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a small enough change? If not we can refactor as part of a different PR.
@@ -35,9 +35,8 @@ static void mul_mat_vec_q_reorder(const void * __restrict__ vx, const void * __r | |||
for (int i = sg.get_local_linear_id() / block_elements_per_subgroup; i < blocks_per_row; i += blocks_per_subgroup) { | |||
const int ibx = row * blocks_per_row + i; // x block index | |||
// TODO: Generalize offsets, right now only works for quantizations that don't split high and low bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR deals with this TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it too, but I wasn't sure I covered all the possible cases. If you are, I am happy to remove the comment.
const int8_t * scales = reinterpret_cast<const int8_t *>(base + d_offset.first); | ||
const ggml_half * d = (const ggml_half *) (base + d_offset.second) + ib; | ||
|
||
const int bq8_offset = 2 * QR6_K * (iqs / (QI6_K / 2)) + (iqs % (QI6_K / 2)) / (QI6_K / 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion:
block traits (traits::qk and such) were introduced to not have the QIK_K, QK_K and such macros lying around. Are we all happy with having the generic traits only in the mmvq entrypoint (mul_mat_vec_q_reorder
)?
I used them in Q4_0, but that case has much simpler quant/dequantize algorithms. Just double checking that this is a conscious choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While computing the offset I found more intuitive to use macros. I don't mind changing it as long as the SYCL backend style is consistent.(also Q4_K reorder still uses macros).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the opposite, to leave them like that. It seems that the macros are shorter and easier to read, but wanted to see what others thought about it.
Signed-off-by: nscipione <[email protected]>
This PR implements quants reordering for mmvq for Q6_K quantization following the work done for Q4_0 and Q4_K.
These changes give good results on BMG and do not detriment performance on other GPUs.
Performance impact
All numbers taken with
GGML_SYCL_DISABLE_OPT=0
.Battlemage B580
Lunar Lake
Intel Arc A770