sgemm for IQ4_NL #8049

netrunnereve · 2024-06-21T04:30:17Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Since IQ4_NL is basically Q4_0 with an additional look up table on the weights we can easily add it to sgemm alongside the existing Q4_0 implementation. Currently prompt processing is around 10% faster with this change but inference becomes 5% slower.

As I only have an Ivy Bridge computer I'll need someone to benchmark this with AVX2 and check if it's actually faster than master for prompt processing. I mean I think it's faster, but if it isn't I'll make this change AVX only.

(llama_bench chart removed as the numbers were off, see the comment below for my new results)

netrunnereve · 2024-06-21T21:56:29Z

After further testing on my desktop (not the inconsistent server VM that I posted my original results with) I'm seeing a clear 5% degradation in inference speed with sgemm on IQ4_NL, while prompt processing speed is improved by around 10%. On the server I have seen up to a 15% prompt processing boost in some cases but the 5% inference slowdown is present as well.

What's happening here is that sgemm overrides the existing ggml_vec_dot kernels with its own for both prompt processing and inference. The IQ4_NL ggml_vec_dot implementation obviously doesn't have tiling so it's slower for prompt processing multiplication but it computes two blocks per loop which gives it a small boost during inference.

Desktop results (Xeon E3 v2, 4c/8t)

model	size	params	backend	threads	test	t/s
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	pp512	6.12 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	tg128	4.62 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	6.74 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	4.37 ± 0.00

Server results (8 core VM on Xeon E5 v2, 8c/16t, unloaded rerun)

model	size	params	backend	threads	test	t/s
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	pp512	9.23 ± 0.02
llama 8B IQ4_NL - 4.5 bpw (Master)	4.35 GiB	8.03 B	CPU	8	tg128	6.96 ± 0.05
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	10.29 ± 0.01
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	6.54 ± 0.17

I'm not interested in modifying sgemm to do two blocks per loop and that'll also mess with how tiling is set up. Right now I guess the question is whether or not a 10-15% improvement in prompt processing is worth a 5% regression in inference speed.

netrunnereve · 2024-06-22T18:12:02Z

I'm closing this as IQ4_XS and Q4_K_S completely trump IQ4_NL performance wise on CPU even without sgemm, while having the same or better perplexity and KL divergence. IQ4_NL was made for the special case where we can't use the I or K quant superblocks and pretty much all modern models don't have this issue.

If anyone's interested feel free to reopen this or improve on my code, but I really don't see the point in this.

model	size	params	backend	threads	test	t/s
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	pp512	10.82 ± 0.01
llama 8B IQ4_XS - 4.25 bpw	4.13 GiB	8.03 B	CPU	8	tg128	7.74 ± 0.08
llama 8B Q4_K - Small	4.36 GiB	8.03 B	CPU	8	pp512	11.89 ± 0.17
llama 8B Q4_K - Small	4.36 GiB	8.03 B	CPU	8	tg128	7.93 ± 0.03
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	pp512	10.29 ± 0.01
llama 8B IQ4_NL - 4.5 bpw (PR)	4.35 GiB	8.03 B	CPU	8	tg128	6.54 ± 0.17

* squashed readd my iq4_nl sgemm PR #8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per #8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

* squashed readd my iq4_nl sgemm PR ggerganov/llama.cpp#8049 have ggml_vec_dot_q4_0 do two blocks per loop for avx try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per ggerganov/llama.cpp#8549 we can calculate several blocks at a time with no issue * shuffle * remove f16c iq4_nl as i cant make it faster than before

netrunnereve added 20 commits June 9, 2024 23:48

initial iq4_xs

0fd5a1b

fix ci

b7e1707

Merge branch 'ggerganov:master' into avx_iq

2f37328

iq4_nl

8d1d112

iq1_m

5ff64ad

iq1_s

75370d7

iq2_xxs

65765c9

Merge branch 'ggerganov:master' into avx_iq

520361f

iq3_xxs

5926186

iq2_s

dcfee06

iq2_xs

eccc609

iq3_s before sllv

39e816e

iq3_s

99f666c

iq3_s small fix

b57187f

iq3_s sllv can be safely replaced with sse multiply

29e2a96

Merge branch 'ggerganov:master' into avx_iq

a055767

iq4_nl sgemm

6559208

Merge branch 'avx_iq' into sgemm_iq4_nl

c848b71

oops

b54877c

Update sgemm.cpp

ced082c

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 21, 2024

netrunnereve added 2 commits June 21, 2024 00:33

fix ci

ffd430b

Merge branch 'ggerganov:master' into sgemm_iq4_nl

1f6e1b0

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 21, 2024

netrunnereve closed this Jun 22, 2024

netrunnereve deleted the sgemm_iq4_nl branch June 22, 2024 18:12

This was referenced Sep 6, 2024

Only enable sgemm for prompt processing, not for inference #9330

Merged

IQ4_NL sgemm + Q4_0 AVX optimization #9422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgemm for IQ4_NL #8049

sgemm for IQ4_NL #8049

netrunnereve commented Jun 21, 2024 •

edited

Loading

netrunnereve commented Jun 21, 2024 •

edited

Loading

netrunnereve commented Jun 22, 2024

sgemm for IQ4_NL #8049

sgemm for IQ4_NL #8049

Conversation

netrunnereve commented Jun 21, 2024 • edited Loading

netrunnereve commented Jun 21, 2024 • edited Loading

netrunnereve commented Jun 22, 2024

netrunnereve commented Jun 21, 2024 •

edited

Loading

netrunnereve commented Jun 21, 2024 •

edited

Loading