-
Notifications
You must be signed in to change notification settings - Fork 38
Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174
Comments
Thanks for your report! |
we use the fp32 |
I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8. |
I've done some tests on 12900K. The latency result shows that NeuralSpeed(weight_dtype=int4, group_size=32, compute_dtype=int8) beats llama.cpp(phi-2.Q4_0.gguf).
How do you measure the 13.699070483881153 tps? Can you provide some steps to reproduce this tps? |
This is the tool to get the benchmark number: https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python |
The target machine doesn't have avx_vnni and we tested int8+int4. The perf is similar to fp32+int4. |
we will plan it as a client target enhancement. |
This issue will be fixed in this PR: #209 |
@yufenglee For AVX2 devices without AVX_VNNI instructions, GGML uses _mm256_maddubs_epi16 as a replacement. But this instruction has over-flow risk. The result of int8 * int8+int8 * int8 may be larger than the maximum value of int16. The result will be clipped. Are you willing to accept this instruction as a replacement of AVX_VNNI which could decrease accuracy? For NBits lower than 8, it won't be a problem. |
As it won’t be an issue for bits lower than 8 bits, it should be fine. We mainly use blockwise quantization for bits lower than 8. |
According to this comment, this issue should have been fixed: #209 (comment) |
@yufenglee |
I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI tool.
The ONNXRuntime-GenAI tool was used to run a CPU-based int4 version of Phi-2 that utilizes the MatmulNBits operator. The performance of this was then compared with the metrics from the Llama.cpp operator.
The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.
Upon profiling the MatmulNBits operator, it was identified as the bottleneck for the model. I will insert some performance metrics acquired with the onnxruntime profiling tool here for further analysis.
Past sequence length 29, Total sequence length 30
Past sequence length 128, Total sequence length 129
Past sequence length 512, Total sequence length 513
This issue needs to be addressed to improve the performance of the Neural Speed Matmul operator and bring it up to par with the Llama.cpp operator.
The text was updated successfully, but these errors were encountered: