Skip to content

Latest commit

 

History

History
68 lines (52 loc) · 5.33 KB

a100_fp16.md

File metadata and controls

68 lines (52 loc) · 5.33 KB

TurboMind Benchmark on A100

All the following results are tested on A100-80G(x8) CUDA 11.8.

The tested lmdeploy version is v0.2.0

Request Throughput Benchmark

  • batch: the max batch size during inference
  • tp: the number of GPU cards for tensor parallelism
  • num_prompts: the number of prompts, i.e. the number of requests
  • PRS: Request Per Second
  • FTL: First Token Latency

FP16

model batch tp num_promts RPS FTL(ave)(s) FTL(min)(s) FTL(max)(s) 50%(s) 75%(s) 95%(s) 99%(s) throughput(out tok/s) throughput(total tok/s)
llama2-7b 256 1 3000 14.556 0.526 0.092 4.652 0.066 0.101 0.155 0.220 3387.419 6981.159
llama2-13b 128 1 3000 7.950 0.352 0.075 4.193 0.051 0.067 0.138 0.202 1850.145 3812.978
internlm-20b 128 2 3000 10.291 0.287 0.073 3.845 0.053 0.072 0.113 0.161 2053.266 4345.057
llama2-70b 256 4 3000 7.231 1.075 0.139 14.524 0.102 0.153 0.292 0.482 1682.738 3467.969

W4A16

KV8

api_server Benchmark

FP16

W4A16

KV8

Static Inference Benchmark

  • batch: the max batch size during inference
  • tp: the number of GPU cards for tensor parallelism
  • prompt_tokens: the number of input tokens
  • output_tokens: the number of generated tokens
  • throughput: the number of generated tokens per second
  • FTL: First Token Latency

FP16

batch tp prompt_tokens output_tokens throughput(out tok/s) mem(GB) FTL(ave)(s) FTL(min)(s) FTL(max)(s) 50%(s) 75%(s) 95%(s) 99%(s)
1 1 1 128 100.02 76.55 0.011 0.01 0.011 0.009 0.009 0.01 0.011
1 1 128 128 102.21 76.59 0.022 0.022 0.022 0.01 0.01 0.01 0.01
1 1 128 2048 98.92 76.59 0.022 0.022 0.022 0.01 0.01 0.01 0.01
1 1 2048 128 86.1 76.77 0.139 0.139 0.14 0.01 0.01 0.01 0.011
1 1 2048 2048 93.78 76.77 0.14 0.139 0.141 0.011 0.011 0.011 0.011
16 1 1 128 1504.72 76.59 0.021 0.011 0.031 0.01 0.011 0.011 0.013
16 1 128 128 1272.47 76.77 0.129 0.023 0.149 0.011 0.011 0.012 0.014
16 1 128 2048 1010.62 76.77 0.13 0.023 0.144 0.015 0.018 0.02 0.021
16 1 2048 128 348.87 78.3 2.897 0.143 3.576 0.02 0.021 0.022 0.025
16 1 2048 2048 601.63 78.3 2.678 0.142 3.084 0.025 0.028 0.03 0.031
32 1 1 128 2136.73 76.62 0.079 0.014 0.725 0.011 0.012 0.013 0.021
32 1 128 128 2125.47 76.99 0.214 0.022 0.359 0.012 0.013 0.014 0.035
32 1 128 2048 1462.12 76.99 0.2 0.026 0.269 0.021 0.026 0.031 0.033
32 1 2048 128 450.43 78.3 4.288 0.143 5.267 0.031 0.032 0.034 0.161
32 1 2048 2048 733.34 78.34 4.118 0.19 5.429 0.04 0.045 0.05 0.053
64 1 1 128 4154.81 76.71 0.042 0.013 0.21 0.012 0.018 0.028 0.041
64 1 128 128 3024.07 77.43 0.44 0.026 1.061 0.014 0.018 0.026 0.158
64 1 128 2048 1852.06 77.96 0.535 0.027 1.231 0.03 0.041 0.048 0.053
64 1 2048 128 493.46 78.4 6.59 0.142 16.235 0.046 0.049 0.055 0.767
64 1 2048 2048 755.65 78.4 39.105 0.142 116.285 0.047 0.049 0.051 0.207