Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174
Description
I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI tool.
The ONNXRuntime-GenAI tool was used to run a CPU-based int4 version of Phi-2 that utilizes the MatmulNBits operator. The performance of this was then compared with the metrics from the Llama.cpp operator.
The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.
Upon profiling the MatmulNBits operator, it was identified as the bottleneck for the model. I will insert some performance metrics acquired with the onnxruntime profiling tool here for further analysis.
Past sequence length 29, Total sequence length 30
Name | Duration | Pct | Count | Cumulative Pct | Cumulative Dur |
---|---|---|---|---|---|
MatMulNBits | 4132089 | 87.39 | 15826 | 87.39 | 4132089 |
MultiHeadAttention | 289191 | 6.12 | 2624 | 93.50 | 4421280 |
Add | 131205 | 2.77 | 16072 | 96.28 | 4552485 |
FastGelu | 67065 | 1.42 | 2624 | 97.69 | 4619550 |
Past sequence length 128, Total sequence length 129
Name | Duration | Pct | Count | Cumulative Pct | Cumulative Dur |
---|---|---|---|---|---|
MatMulNBits | 3882211 | 81.92 | 15440 | 81.92 | 3882211 |
MultiHeadAttention | 576563 | 12.17 | 2560 | 94.08 | 4458774 |
Add | 118635 | 2.50 | 15680 | 96.59 | 4577409 |
FastGelu | 60107 | 1.27 | 2560 | 97.86 | 4637516 |
Past sequence length 512, Total sequence length 513
Name | Duration | Pct | Count | Cumulative Pct | Cumulative Dur |
---|---|---|---|---|---|
MatMulNBits | 3054838 | 62.79 | 11773 | 62.79 | 3054838 |
MultiHeadAttention | 1582324 | 32.53 | 1952 | 95.32 | 4637162 |
Add | 98730 | 2.03 | 11956 | 97.35 | 4735892 |
FastGelu | 48359 | 0.99 | 1952 | 98.34 | 4784251 |
This issue needs to be addressed to improve the performance of the Neural Speed Matmul operator and bring it up to par with the Llama.cpp operator.