Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI tool.
The ONNXRuntime-GenAI tool was used to run a CPU-based int4 version of Phi-2 that utilizes the MatmulNBits operator. The performance of this was then compared with the metrics from the Llama.cpp operator.
The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.
Upon profiling the MatmulNBits operator, it was identified as the bottleneck for the model. I will insert some performance metrics acquired with the onnxruntime profiling tool here for further analysis.

**Past sequence length 29, Total sequence length 30**
Name | Duration | Pct | Count | Cumulative Pct | Cumulative Dur
-- | -- | -- | -- | -- | --
MatMulNBits | 4132089 | 87.39 | 15826 | 87.39 | 4132089
MultiHeadAttention | 289191 | 6.12 | 2624 | 93.50 | 4421280
Add | 131205 | 2.77 | 16072 | 96.28 | 4552485
FastGelu | 67065 | 1.42 | 2624 | 97.69 | 4619550

**Past sequence length 128, Total sequence length 129**
Name | Duration | Pct | Count | Cumulative Pct | Cumulative Dur
-- | -- | -- | -- | -- | --
MatMulNBits | 3882211 | 81.92 | 15440 | 81.92 | 3882211
MultiHeadAttention | 576563 | 12.17 | 2560 | 94.08 | 4458774
Add | 118635 | 2.50 | 15680 | 96.59 | 4577409
FastGelu | 60107 | 1.27 | 2560 | 97.86 | 4637516

**Past sequence length 512, Total sequence length 513**
Name | Duration | Pct | Count | Cumulative Pct | Cumulative Dur
-- | -- | -- | -- | -- | --
MatMulNBits | 3054838 | 62.79 | 11773 | 62.79 | 3054838
MultiHeadAttention | 1582324 | 32.53 | 1952 | 95.32 | 4637162
Add | 98730 | 2.03 | 11956 | 97.35 | 4735892
FastGelu | 48359 | 0.99 | 1952 | 98.34 | 4784251

This issue needs to be addressed to improve the performance of the Neural Speed Matmul operator and bring it up to par with the Llama.cpp operator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Name	Duration	Pct	Count	Cumulative Pct	Cumulative Dur
MatMulNBits	4132089	87.39	15826	87.39	4132089
MultiHeadAttention	289191	6.12	2624	93.50	4421280
Add	131205	2.77	16072	96.28	4552485
FastGelu	67065	1.42	2624	97.69	4619550

Name	Duration	Pct	Count	Cumulative Pct	Cumulative Dur
MatMulNBits	3882211	81.92	15440	81.92	3882211
MultiHeadAttention	576563	12.17	2560	94.08	4458774
Add	118635	2.50	15680	96.59	4577409
FastGelu	60107	1.27	2560	97.86	4637516

Name	Duration	Pct	Count	Cumulative Pct	Cumulative Dur
MatMulNBits	3054838	62.79	11773	62.79	3054838
MultiHeadAttention	1582324	32.53	1952	95.32	4637162
Add	98730	2.03	11956	97.35	4735892
FastGelu	48359	0.99	1952	98.34	4784251

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions