[Performance issue] Unable to Reproduce the Throughput of Llama 3.1 8B FP8 on H100

Hi, I attempted to reproduce the Total Output Throughput (tokens/sec) reported here:
https://github.com/NVIDIA/TensorRT-LLM/blob/v0.20.0/docs/source/performance/perf-overview.md#llama-31-8b-fp8

For ISL/OSL 128/128, the expected throughput is 27,688.36 tokens/sec, but I only achieved 7099.5345 tokens/sec.

Am I missing something?

Environment:
Container launched on [Runpod](https://www.runpod.io/)
1 x H100 PCIe
80G VRAM
16 vCPU 251 GB RAM
tensorrt_llm==0.20.0

extra-llm-api-config.yml
```
cuda_graph_config:
  batch_sizes:
  - 128
enable_attention_dp: true
```
The way I installed trtllm
```
apt-get -y install libopenmpi-dev
pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm
```

Full output log:
```
root@7f4c828c65a8:/workspace/trtllm-benchmark# trtllm-bench --model nvidia/Llama-3.1-8B-Instruct-FP8 throughput --dataset /workspace/datasets/nvidia/Llama-3.1-8B-Instruct-FP8/128_128_30000.jsonl --backend pytorch --extra_llm_api_options extra-llm-api-config.yml --kv_cache_free_gpu_mem_fraction 0.8
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-07-23 09:13:59,432 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[07/23/2025-09:14:01] [TRT-LLM] [I] Preparing to run throughput benchmark...
Parse safetensors files: 100%|█████████████████████████████████████| 2/2 [00:04<00:00,  2.26s/it]
[07/23/2025-09:14:15] [TRT-LLM] [I]
===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /workspace/datasets/nvidia/Llama-3.1-8B-Instruct-FP8/128_128_30000.jsonl
Number of Sequences:  30000

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   128.0000            28.0000           156.0000
MAX:   128.0000            28.0000           156.0000
AVG:   128.0000            28.0000           156.0000
P50:   128.0000            28.0000           156.0000
P90:   128.0000            28.0000           156.0000
P95:   128.0000            28.0000           156.0000
P99:   128.0000            28.0000           156.0000
===========================================================

Fetching 12 files: 100%|████████████████████████████████████████| 12/12 [00:00<00:00, 208.34it/s]
Parse safetensors files: 100%|█████████████████████████████████████| 2/2 [00:00<00:00,  8.31it/s]
[07/23/2025-09:14:16] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[07/23/2025-09:14:16] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated engine size: 7.48 GB
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated total available memory for KV cache: 72.17 GB
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated total KV cache memory: 68.56 GB
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 7200.59
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated max batch size (after fine-tune): 4096
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated max num tokens (after fine-tune): 23552
[07/23/2025-09:14:16] [TRT-LLM] [I] Max batch size and max num tokens not provided. Using heuristics or pre-defined settings: max_batch_size=4096, max_num_tokens=23552.
[07/23/2025-09:14:16] [TRT-LLM] [I] Setting PyTorch max sequence length to 156
[07/23/2025-09:14:16] [TRT-LLM] [I] Setting up throughput benchmark.
[07/23/2025-09:14:16] [TRT-LLM] [W] Using default gpus_per_node: 1
[07/23/2025-09:14:16] [TRT-LLM] [I] Set nccl_plugin to None.
[07/23/2025-09:14:17] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 256, 512, 1024, 2048, 4096], cuda_graph_max_batch_size=4096, cuda_graph_padding_enabled=True, disable_overlap_scheduler=False, moe_max_num_tokens=None, attn_backend='TRTLLM', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
Multiple distributions found for package optimum. Picked distribution: optimum
2025-07-23 09:14:27,490 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[07/23/2025-09:14:30] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[07/23/2025-09:14:30] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[07/23/2025-09:14:31] [TRT-LLM] [I] Rank 0 uses 8.46 GB for model weights.
[07/23/2025-09:14:31] [TRT-LLM] [I] Rank 0 prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00001-of-00002.safetensors to memory...
[07/23/2025-09:14:31] [TRT-LLM] [I] Rank 0 prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00002-of-00002.safetensors to memory...
[07/23/2025-09:15:43] [TRT-LLM] [I] Rank 0 finished prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00002-of-00002.safetensors.
[07/23/2025-09:15:56] [TRT-LLM] [I] Rank 0 finished prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00001-of-00002.safetensors.
[07/23/2025-09:15:56] [TRT-LLM] [I] Loading /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00001-of-00002.safetensors
[07/23/2025-09:15:56] [TRT-LLM] [I] Loading /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00002-of-00002.safetensors
Loading weights: 100%|██████████| 617/617 [00:02<00:00, 256.10it/s]
Model init total -- 88.37s
[07/23/2025-09:15:59] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.800000011920929 and 28384, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 5 [window size=157]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.73 GiB for max tokens in paged KV cache (28384).
[07/23/2025-09:15:59] [TRT-LLM] [I] max_seq_len=157, max_num_requests=4096, max_num_tokens=23552
[07/23/2025-09:15:59] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[07/23/2025-09:15:59] [TRT-LLM] [I] Run autotuning warmup for batch size=1
2025-07-23 09:15:59,759 - INFO - flashinfer.jit: Loading JIT ops: norm
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-07-23 09:15:59,903 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-07-23 09:16:00,896 - INFO - flashinfer.jit: Loading JIT ops: silu_and_mul
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-07-23 09:16:01,321 - INFO - flashinfer.jit: Finished loading JIT ops: silu_and_mul
[07/23/2025-09:16:01] [TRT-LLM] [I] Autotuner Cache size after warmup 0
[07/23/2025-09:16:01] [TRT-LLM] [I] [Autotuner]: Autotuning process ends
[07/23/2025-09:16:01] [TRT-LLM] [I] Creating CUDA graph instances for 24 batch sizes.
[07/23/2025-09:16:02] [TRT-LLM] [I] Memory used after loading model weights (inside torch) in memory usage profiling: 8.55 GiB
[07/23/2025-09:16:02] [TRT-LLM] [I] Memory used after loading model weights (outside torch) in memory usage profiling: 3.14 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Memory dynamically allocated during inference (inside torch) in memory usage profiling: 2.95 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Memory used outside torch (e.g., NCCL and CUDA graphs) in memory usage profiling: 3.78 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Peak memory during memory usage profiling (torch + non-torch): 15.29 GiB, available KV cache memory when calculating max tokens: 52.51 GiB, fraction is set 0.800000011920929, kv size is 65536, device total memory 79.19 GiB, , tmp kv_mem 1.73 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Estimated max tokens in KV cache : 860308
[07/23/2025-09:16:03] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.800000011920929 and 860308, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 5 [window size=157]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 52.51 GiB for max tokens in paged KV cache (860320).
[07/23/2025-09:16:03] [TRT-LLM] [I] max_seq_len=157, max_num_requests=4096, max_num_tokens=23552
[07/23/2025-09:16:03] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[07/23/2025-09:16:03] [TRT-LLM] [I] Run autotuning warmup for batch size=1
[07/23/2025-09:16:03] [TRT-LLM] [I] Autotuner Cache size after warmup 0
[07/23/2025-09:16:03] [TRT-LLM] [I] [Autotuner]: Autotuning process ends
[07/23/2025-09:16:03] [TRT-LLM] [I] Creating CUDA graph instances for 24 batch sizes.
[07/23/2025-09:16:04] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=4096
[07/23/2025-09:16:05] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=2048
[07/23/2025-09:16:05] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=1024
[07/23/2025-09:16:06] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=512
[07/23/2025-09:16:06] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=256
[07/23/2025-09:16:06] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128
[07/23/2025-09:16:07] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=120
[07/23/2025-09:16:07] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=112
[07/23/2025-09:16:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=104
[07/23/2025-09:16:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=96
[07/23/2025-09:16:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=88
[07/23/2025-09:16:09] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=80
[07/23/2025-09:16:09] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=72
[07/23/2025-09:16:10] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=64
[07/23/2025-09:16:10] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=56
[07/23/2025-09:16:10] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=48
[07/23/2025-09:16:11] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=40
[07/23/2025-09:16:11] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=32
[07/23/2025-09:16:11] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=24
[07/23/2025-09:16:12] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=16
[07/23/2025-09:16:12] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=8
[07/23/2025-09:16:12] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=4
[07/23/2025-09:16:13] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=2
[07/23/2025-09:16:13] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=1
[07/23/2025-09:16:13] [TRT-LLM] [I] Setting up for warmup...
[07/23/2025-09:16:13] [TRT-LLM] [I] Running warmup.
[07/23/2025-09:16:13] [TRT-LLM] [I] Starting benchmarking async task.
[07/23/2025-09:16:13] [TRT-LLM] [I] Starting benchmark...
[07/23/2025-09:16:13] [TRT-LLM] [I] Request submission complete. [count=2, time=0.0001s, rate=30122.30 req/s]
[07/23/2025-09:16:14] [TRT-LLM] [I] Benchmark complete.
[07/23/2025-09:16:14] [TRT-LLM] [I] Stopping LLM backend.
[07/23/2025-09:16:14] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[07/23/2025-09:16:14] [TRT-LLM] [I] All tasks cancelled.
[07/23/2025-09:16:14] [TRT-LLM] [I] LLM Backend stopped.
[07/23/2025-09:16:14] [TRT-LLM] [I] Worker task cancelled.
[07/23/2025-09:16:14] [TRT-LLM] [I] Warmup done.
[07/23/2025-09:16:14] [TRT-LLM] [I] No log path provided, skipping logging.
[07/23/2025-09:16:14] [TRT-LLM] [I] Starting benchmarking async task.
[07/23/2025-09:16:14] [TRT-LLM] [I] Starting benchmark...
[07/23/2025-09:16:14] [TRT-LLM] [I] Request submission complete. [count=30000, time=0.1026s, rate=292499.08 req/s]
[07/23/2025-09:18:13] [TRT-LLM] [I] Benchmark complete.
[07/23/2025-09:18:13] [TRT-LLM] [I] Stopping LLM backend.
[07/23/2025-09:18:13] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[07/23/2025-09:18:13] [TRT-LLM] [I] All tasks cancelled.
[07/23/2025-09:18:13] [TRT-LLM] [I] LLM Backend stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Worker task cancelled.
[07/23/2025-09:18:13] [TRT-LLM] [I] Benchmark done. Reporting results...
[07/23/2025-09:18:13] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[07/23/2025-09:18:13] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[07/23/2025-09:18:13] [TRT-LLM] [I]

===========================================================
= PYTORCH BACKEND
===========================================================
Model:			nvidia/Llama-3.1-8B-Instruct-FP8
Model Path:		None
TensorRT-LLM Version:	0.20.0
Dtype:			bfloat16
KV Cache Dtype:		FP8
Quantization:		FP8

===========================================================
= REQUEST DETAILS
===========================================================
Number of requests:             30000
Number of concurrent requests:  14114.5246
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 28.0000
===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
EP Size:                None
Max Runtime Batch Size: 4096
Max Runtime Tokens:     23552
Scheduling Policy:      GUARANTEED_NO_EVICT
KV Memory Percentage:   80.00%
Issue Rate (req/sec):   3.4376E+15

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     253.5548
Total Output Throughput (tokens/sec):             7099.5345
Total Token Throughput (tokens/sec):              39554.5494
Total Latency (ms):                               118317.6163
Average request latency (ms):                     55666.5636
Per User Output Throughput [w/ ctx] (tps/user):   0.6702
Per GPU Output Throughput (tps/gpu):              7099.5345

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 54422.3568
[Latency] P90    : 94512.0713
[Latency] P95    : 98713.1150
[Latency] P99    : 102475.5310
[Latency] MINIMUM: 15187.1518
[Latency] MAXIMUM: 103317.3451
[Latency] AVERAGE: 55666.5636

===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /workspace/datasets/nvidia/Llama-3.1-8B-Instruct-FP8/128_128_30000.jsonl
Number of Sequences:  30000

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   128.0000            28.0000           156.0000
MAX:   128.0000            28.0000           156.0000
AVG:   128.0000            28.0000           156.0000
P50:   128.0000            28.0000           156.0000
P90:   128.0000            28.0000           156.0000
P95:   128.0000            28.0000           156.0000
P99:   128.0000            28.0000           156.0000
===========================================================

[07/23/2025-09:18:13] [TRT-LLM] [I] Thread proxy_dispatch_result_thread stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Thread proxy_dispatch_stats_thread stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Thread proxy_dispatch_kv_cache_events_thread stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Thread await_response_thread stopped.
[07/23/2025-09:18:14] [TRT-LLM] [I] Thread dispatch_stats_thread stopped.
[07/23/2025-09:18:14] [TRT-LLM] [I] Thread dispatch_kv_cache_events_thread stopped.
```

OOM If I set kv_cache_free_gpu_mem_fraction to 0.9 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance issue] Unable to Reproduce the Throughput of Llama 3.1 8B FP8 on H100 #6294

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance issue] Unable to Reproduce the Throughput of Llama 3.1 8B FP8 on H100 #6294

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions