TurboMind Benchmark on A100

All the following results are tested on A100-80G(x8) CUDA 11.8.

The tested lmdeploy version is v0.2.0

Request Throughput Benchmark

batch: the max batch size during inference
tp: the number of GPU cards for tensor parallelism
num_prompts: the number of prompts, i.e. the number of requests
PRS: Request Per Second
FTL: First Token Latency

FP16

model	batch	tp	num_promts	RPS	FTL(ave)(s)	FTL(min)(s)	FTL(max)(s)	50%(s)	75%(s)	95%(s)	99%(s)	throughput(out tok/s)	throughput(total tok/s)
llama2-7b	256	1	3000	14.556	0.526	0.092	4.652	0.066	0.101	0.155	0.220	3387.419	6981.159
llama2-13b	128	1	3000	7.950	0.352	0.075	4.193	0.051	0.067	0.138	0.202	1850.145	3812.978
internlm-20b	128	2	3000	10.291	0.287	0.073	3.845	0.053	0.072	0.113	0.161	2053.266	4345.057
llama2-70b	256	4	3000	7.231	1.075	0.139	14.524	0.102	0.153	0.292	0.482	1682.738	3467.969

W4A16

KV8

api_server Benchmark

FP16

W4A16

KV8

Static Inference Benchmark

batch: the max batch size during inference
tp: the number of GPU cards for tensor parallelism
prompt_tokens: the number of input tokens
output_tokens: the number of generated tokens
throughput: the number of generated tokens per second
FTL: First Token Latency

FP16

batch	tp	prompt_tokens	output_tokens	throughput(out tok/s)	mem(GB)	FTL(ave)(s)	FTL(min)(s)	FTL(max)(s)	50%(s)	75%(s)	95%(s)	99%(s)
1	1	1	128	100.02	76.55	0.011	0.01	0.011	0.009	0.009	0.01	0.011
1	1	128	128	102.21	76.59	0.022	0.022	0.022	0.01	0.01	0.01	0.01
1	1	128	2048	98.92	76.59	0.022	0.022	0.022	0.01	0.01	0.01	0.01
1	1	2048	128	86.1	76.77	0.139	0.139	0.14	0.01	0.01	0.01	0.011
1	1	2048	2048	93.78	76.77	0.14	0.139	0.141	0.011	0.011	0.011	0.011
16	1	1	128	1504.72	76.59	0.021	0.011	0.031	0.01	0.011	0.011	0.013
16	1	128	128	1272.47	76.77	0.129	0.023	0.149	0.011	0.011	0.012	0.014
16	1	128	2048	1010.62	76.77	0.13	0.023	0.144	0.015	0.018	0.02	0.021
16	1	2048	128	348.87	78.3	2.897	0.143	3.576	0.02	0.021	0.022	0.025
16	1	2048	2048	601.63	78.3	2.678	0.142	3.084	0.025	0.028	0.03	0.031
32	1	1	128	2136.73	76.62	0.079	0.014	0.725	0.011	0.012	0.013	0.021
32	1	128	128	2125.47	76.99	0.214	0.022	0.359	0.012	0.013	0.014	0.035
32	1	128	2048	1462.12	76.99	0.2	0.026	0.269	0.021	0.026	0.031	0.033
32	1	2048	128	450.43	78.3	4.288	0.143	5.267	0.031	0.032	0.034	0.161
32	1	2048	2048	733.34	78.34	4.118	0.19	5.429	0.04	0.045	0.05	0.053
64	1	1	128	4154.81	76.71	0.042	0.013	0.21	0.012	0.018	0.028	0.041
64	1	128	128	3024.07	77.43	0.44	0.026	1.061	0.014	0.018	0.026	0.158
64	1	128	2048	1852.06	77.96	0.535	0.027	1.231	0.03	0.041	0.048	0.053
64	1	2048	128	493.46	78.4	6.59	0.142	16.235	0.046	0.049	0.055	0.767
64	1	2048	2048	755.65	78.4	39.105	0.142	116.285	0.047	0.049	0.051	0.207

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a100_fp16.md

a100_fp16.md

TurboMind Benchmark on A100

Request Throughput Benchmark

FP16

W4A16

KV8

api_server Benchmark

FP16

W4A16

KV8

Static Inference Benchmark

FP16

Files

a100_fp16.md

Latest commit

History

a100_fp16.md

File metadata and controls

TurboMind Benchmark on A100

Request Throughput Benchmark

FP16

W4A16

KV8

api_server Benchmark

FP16

W4A16

KV8

Static Inference Benchmark

FP16