-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sgemm for IQ4_NL #8049
sgemm for IQ4_NL #8049
Conversation
After further testing on my desktop (not the inconsistent server VM that I posted my original results with) I'm seeing a clear 5% degradation in inference speed with sgemm on IQ4_NL, while prompt processing speed is improved by around 10%. On the server I have seen up to a 15% prompt processing boost in some cases but the 5% inference slowdown is present as well. What's happening here is that sgemm overrides the existing Desktop results (Xeon E3 v2, 4c/8t)
Server results (8 core VM on Xeon E5 v2, 8c/16t, unloaded rerun)
I'm not interested in modifying sgemm to do two blocks per loop and that'll also mess with how tiling is set up. Right now I guess the question is whether or not a 10-15% improvement in prompt processing is worth a 5% regression in inference speed. |
I'm closing this as IQ4_XS and Q4_K_S completely trump IQ4_NL performance wise on CPU even without sgemm, while having the same or better perplexity and KL divergence. IQ4_NL was made for the special case where we can't use the I or K quant superblocks and pretty much all modern models don't have this issue. If anyone's interested feel free to reopen this or improve on my code, but I really don't see the point in this.
|
* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before
Since IQ4_NL is basically Q4_0 with an additional look up table on the weights we can easily add it to sgemm alongside the existing Q4_0 implementation. Currently prompt processing is around 10% faster with this change but inference becomes 5% slower.
As I only have an Ivy Bridge computer I'll need someone to benchmark this with AVX2 and check if it's actually faster than master for prompt processing. I mean I think it's faster, but if it isn't I'll make this change AVX only.
(
llama_bench
chart removed as the numbers were off, see the comment below for my new results)