Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

Open
aciddelgado opened this issue Mar 13, 2024 · 13 comments
Assignees
Labels
enhancement New feature or request

Comments

@aciddelgado
Copy link

I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI tool.
The ONNXRuntime-GenAI tool was used to run a CPU-based int4 version of Phi-2 that utilizes the MatmulNBits operator. The performance of this was then compared with the metrics from the Llama.cpp operator.
The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.
Upon profiling the MatmulNBits operator, it was identified as the bottleneck for the model. I will insert some performance metrics acquired with the onnxruntime profiling tool here for further analysis.

Past sequence length 29, Total sequence length 30

Name Duration Pct Count Cumulative Pct Cumulative Dur
MatMulNBits 4132089 87.39 15826 87.39 4132089
MultiHeadAttention 289191 6.12 2624 93.50 4421280
Add 131205 2.77 16072 96.28 4552485
FastGelu 67065 1.42 2624 97.69 4619550

Past sequence length 128, Total sequence length 129

Name Duration Pct Count Cumulative Pct Cumulative Dur
MatMulNBits 3882211 81.92 15440 81.92 3882211
MultiHeadAttention 576563 12.17 2560 94.08 4458774
Add 118635 2.50 15680 96.59 4577409
FastGelu 60107 1.27 2560 97.86 4637516

Past sequence length 512, Total sequence length 513

Name Duration Pct Count Cumulative Pct Cumulative Dur
MatMulNBits 3054838 62.79 11773 62.79 3054838
MultiHeadAttention 1582324 32.53 1952 95.32 4637162
Add 98730 2.03 11956 97.35 4735892
FastGelu 48359 0.99 1952 98.34 4784251

This issue needs to be addressed to improve the performance of the Neural Speed Matmul operator and bring it up to par with the Llama.cpp operator.

@luoyu-intel
Copy link
Contributor

Thanks for your report!
What's the accuracy level of this model's MatMulNBits?

@luoyu-intel luoyu-intel self-assigned this Mar 14, 2024
@yufenglee
Copy link

Thanks for your report! What's the accuracy level of this model's MatMulNBits?

we use the fp32

@luoyu-intel
Copy link
Contributor

I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8.

@luoyu-intel
Copy link
Contributor

I've done some tests on 12900K. The latency result shows that NeuralSpeed(weight_dtype=int4, group_size=32, compute_dtype=int8) beats llama.cpp(phi-2.Q4_0.gguf).

The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.

How do you measure the 13.699070483881153 tps? Can you provide some steps to reproduce this tps?

@yufenglee
Copy link

This is the tool to get the benchmark number: https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python

@yufenglee
Copy link

I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8.

The target machine doesn't have avx_vnni and we tested int8+int4. The perf is similar to fp32+int4.

@luoyu-intel luoyu-intel added the enhancement New feature or request label Apr 2, 2024
@luoyu-intel
Copy link
Contributor

we will plan it as a client target enhancement.

@luoyu-intel
Copy link
Contributor

luoyu-intel commented Apr 9, 2024

This issue will be fixed in this PR: #209

@yufenglee
Copy link

yufenglee commented Apr 9, 2024 via email

@luoyu-intel
Copy link
Contributor

luoyu-intel commented Apr 17, 2024

@yufenglee For AVX2 devices without AVX_VNNI instructions, GGML uses _mm256_maddubs_epi16 as a replacement. But this instruction has over-flow risk. The result of int8 * int8+int8 * int8 may be larger than the maximum value of int16. The result will be clipped. Are you willing to accept this instruction as a replacement of AVX_VNNI which could decrease accuracy?

For NBits lower than 8, it won't be a problem.

@yufenglee
Copy link

As it won’t be an issue for bits lower than 8 bits, it should be fine. We mainly use blockwise quantization for bits lower than 8.

@luoyu-intel
Copy link
Contributor

According to this comment, this issue should have been fixed: #209 (comment)

@kevinintel
Copy link
Contributor

@yufenglee
I will close this issue if you don't have concerns

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants