Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgemm for IQ4_NL #8049

Closed
wants to merge 22 commits into from
Closed

sgemm for IQ4_NL #8049

wants to merge 22 commits into from

Conversation

netrunnereve
Copy link
Collaborator

@netrunnereve netrunnereve commented Jun 21, 2024

Since IQ4_NL is basically Q4_0 with an additional look up table on the weights we can easily add it to sgemm alongside the existing Q4_0 implementation. Currently prompt processing is around 10% faster with this change but inference becomes 5% slower.

As I only have an Ivy Bridge computer I'll need someone to benchmark this with AVX2 and check if it's actually faster than master for prompt processing. I mean I think it's faster, but if it isn't I'll make this change AVX only.

(llama_bench chart removed as the numbers were off, see the comment below for my new results)

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2024
@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Jun 21, 2024

After further testing on my desktop (not the inconsistent server VM that I posted my original results with) I'm seeing a clear 5% degradation in inference speed with sgemm on IQ4_NL, while prompt processing speed is improved by around 10%. On the server I have seen up to a 15% prompt processing boost in some cases but the 5% inference slowdown is present as well.

What's happening here is that sgemm overrides the existing ggml_vec_dot kernels with its own for both prompt processing and inference. The IQ4_NL ggml_vec_dot implementation obviously doesn't have tiling so it's slower for prompt processing multiplication but it computes two blocks per loop which gives it a small boost during inference.

Desktop results (Xeon E3 v2, 4c/8t)

model size params backend threads test t/s
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 8 pp512 6.12 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 8 tg128 4.62 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 pp512 6.74 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 tg128 4.37 ± 0.00

Server results (8 core VM on Xeon E5 v2, 8c/16t, unloaded rerun)

model size params backend threads test t/s
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 8 pp512 9.23 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (Master) 4.35 GiB 8.03 B CPU 8 tg128 6.96 ± 0.05
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 pp512 10.29 ± 0.01
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 tg128 6.54 ± 0.17

I'm not interested in modifying sgemm to do two blocks per loop and that'll also mess with how tiling is set up. Right now I guess the question is whether or not a 10-15% improvement in prompt processing is worth a 5% regression in inference speed.

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024
@netrunnereve
Copy link
Collaborator Author

I'm closing this as IQ4_XS and Q4_K_S completely trump IQ4_NL performance wise on CPU even without sgemm, while having the same or better perplexity and KL divergence. IQ4_NL was made for the special case where we can't use the I or K quant superblocks and pretty much all modern models don't have this issue.

If anyone's interested feel free to reopen this or improve on my code, but I really don't see the point in this.

model size params backend threads test t/s
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 8 pp512 10.82 ± 0.01
llama 8B IQ4_XS - 4.25 bpw 4.13 GiB 8.03 B CPU 8 tg128 7.74 ± 0.08
llama 8B Q4_K - Small 4.36 GiB 8.03 B CPU 8 pp512 11.89 ± 0.17
llama 8B Q4_K - Small 4.36 GiB 8.03 B CPU 8 tg128 7.93 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 pp512 10.29 ± 0.01
llama 8B IQ4_NL - 4.5 bpw (PR) 4.35 GiB 8.03 B CPU 8 tg128 6.54 ± 0.17

ggerganov pushed a commit that referenced this pull request Sep 16, 2024
* squashed

readd my iq4_nl sgemm PR #8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Sep 20, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/ggml that referenced this pull request Sep 20, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/whisper.cpp that referenced this pull request Sep 24, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
ggerganov pushed a commit to ggerganov/whisper.cpp that referenced this pull request Sep 24, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
lyapple2008 pushed a commit to lyapple2008/whisper.cpp.mars that referenced this pull request Nov 2, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
github-actions bot pushed a commit to martin-steinegger/ProstT5-llama that referenced this pull request Dec 30, 2024
* squashed

readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants