Skip to content

[Performance issue] Unable to Reproduce the Throughput of Llama 3.1 8B FP8 on H100 #6294

@skymizerFuji

Description

@skymizerFuji

Hi, I attempted to reproduce the Total Output Throughput (tokens/sec) reported here:
https://github.com/NVIDIA/TensorRT-LLM/blob/v0.20.0/docs/source/performance/perf-overview.md#llama-31-8b-fp8

For ISL/OSL 128/128, the expected throughput is 27,688.36 tokens/sec, but I only achieved 7099.5345 tokens/sec.

Am I missing something?

Environment:
Container launched on Runpod
1 x H100 PCIe
80G VRAM
16 vCPU 251 GB RAM
tensorrt_llm==0.20.0

extra-llm-api-config.yml

cuda_graph_config:
  batch_sizes:
  - 128
enable_attention_dp: true

The way I installed trtllm

apt-get -y install libopenmpi-dev
pip3 install --upgrade pip setuptools && pip3 install tensorrt_llm

Full output log:

root@7f4c828c65a8:/workspace/trtllm-benchmark# trtllm-bench --model nvidia/Llama-3.1-8B-Instruct-FP8 throughput --dataset /workspace/datasets/nvidia/Llama-3.1-8B-Instruct-FP8/128_128_30000.jsonl --backend pytorch --extra_llm_api_options extra-llm-api-config.yml --kv_cache_free_gpu_mem_fraction 0.8
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-07-23 09:13:59,432 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[07/23/2025-09:14:01] [TRT-LLM] [I] Preparing to run throughput benchmark...
Parse safetensors files: 100%|█████████████████████████████████████| 2/2 [00:04<00:00,  2.26s/it]
[07/23/2025-09:14:15] [TRT-LLM] [I]
===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /workspace/datasets/nvidia/Llama-3.1-8B-Instruct-FP8/128_128_30000.jsonl
Number of Sequences:  30000

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   128.0000            28.0000           156.0000
MAX:   128.0000            28.0000           156.0000
AVG:   128.0000            28.0000           156.0000
P50:   128.0000            28.0000           156.0000
P90:   128.0000            28.0000           156.0000
P95:   128.0000            28.0000           156.0000
P99:   128.0000            28.0000           156.0000
===========================================================

Fetching 12 files: 100%|████████████████████████████████████████| 12/12 [00:00<00:00, 208.34it/s]
Parse safetensors files: 100%|█████████████████████████████████████| 2/2 [00:00<00:00,  8.31it/s]
[07/23/2025-09:14:16] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[07/23/2025-09:14:16] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated engine size: 7.48 GB
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated total available memory for KV cache: 72.17 GB
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated total KV cache memory: 68.56 GB
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 7200.59
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated max batch size (after fine-tune): 4096
[07/23/2025-09:14:16] [TRT-LLM] [I] Estimated max num tokens (after fine-tune): 23552
[07/23/2025-09:14:16] [TRT-LLM] [I] Max batch size and max num tokens not provided. Using heuristics or pre-defined settings: max_batch_size=4096, max_num_tokens=23552.
[07/23/2025-09:14:16] [TRT-LLM] [I] Setting PyTorch max sequence length to 156
[07/23/2025-09:14:16] [TRT-LLM] [I] Setting up throughput benchmark.
[07/23/2025-09:14:16] [TRT-LLM] [W] Using default gpus_per_node: 1
[07/23/2025-09:14:16] [TRT-LLM] [I] Set nccl_plugin to None.
[07/23/2025-09:14:17] [TRT-LLM] [I] PyTorchConfig(extra_resource_managers={}, use_cuda_graph=True, cuda_graph_batch_sizes=[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 256, 512, 1024, 2048, 4096], cuda_graph_max_batch_size=4096, cuda_graph_padding_enabled=True, disable_overlap_scheduler=False, moe_max_num_tokens=None, attn_backend='TRTLLM', moe_backend='CUTLASS', mixed_sampler=False, enable_trtllm_sampler=False, kv_cache_dtype='auto', use_kv_cache=True, enable_iter_perf_stats=False, enable_iter_req_stats=False, print_iter_log=False, torch_compile_enabled=False, torch_compile_fullgraph=True, torch_compile_inductor_enabled=False, torch_compile_piecewise_cuda_graph=False, torch_compile_enable_userbuffers=True, autotuner_enabled=True, enable_layerwise_nvtx_marker=False, load_format=<LoadFormat.AUTO: 0>)
rank 0 using MpiPoolSession to spawn MPI processes
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[07/23/2025-09:14:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1184: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
Multiple distributions found for package optimum. Picked distribution: optimum
2025-07-23 09:14:27,490 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM][INFO] Refreshed the MPI local session
[07/23/2025-09:14:30] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[07/23/2025-09:14:30] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[07/23/2025-09:14:31] [TRT-LLM] [I] Rank 0 uses 8.46 GB for model weights.
[07/23/2025-09:14:31] [TRT-LLM] [I] Rank 0 prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00001-of-00002.safetensors to memory...
[07/23/2025-09:14:31] [TRT-LLM] [I] Rank 0 prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00002-of-00002.safetensors to memory...
[07/23/2025-09:15:43] [TRT-LLM] [I] Rank 0 finished prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00002-of-00002.safetensors.
[07/23/2025-09:15:56] [TRT-LLM] [I] Rank 0 finished prefetching /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00001-of-00002.safetensors.
[07/23/2025-09:15:56] [TRT-LLM] [I] Loading /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00001-of-00002.safetensors
[07/23/2025-09:15:56] [TRT-LLM] [I] Loading /workspace/.huggingface/hub/models--nvidia--Llama-3.1-8B-Instruct-FP8/snapshots/026c7c29fdd02d53f17c125d2dec8cb1a2251c23/model-00002-of-00002.safetensors
Loading weights: 100%|██████████| 617/617 [00:02<00:00, 256.10it/s]
Model init total -- 88.37s
[07/23/2025-09:15:59] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.800000011920929 and 28384, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 5 [window size=157]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.73 GiB for max tokens in paged KV cache (28384).
[07/23/2025-09:15:59] [TRT-LLM] [I] max_seq_len=157, max_num_requests=4096, max_num_tokens=23552
[07/23/2025-09:15:59] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[07/23/2025-09:15:59] [TRT-LLM] [I] Run autotuning warmup for batch size=1
2025-07-23 09:15:59,759 - INFO - flashinfer.jit: Loading JIT ops: norm
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-07-23 09:15:59,903 - INFO - flashinfer.jit: Finished loading JIT ops: norm
2025-07-23 09:16:00,896 - INFO - flashinfer.jit: Loading JIT ops: silu_and_mul
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-07-23 09:16:01,321 - INFO - flashinfer.jit: Finished loading JIT ops: silu_and_mul
[07/23/2025-09:16:01] [TRT-LLM] [I] Autotuner Cache size after warmup 0
[07/23/2025-09:16:01] [TRT-LLM] [I] [Autotuner]: Autotuning process ends
[07/23/2025-09:16:01] [TRT-LLM] [I] Creating CUDA graph instances for 24 batch sizes.
[07/23/2025-09:16:02] [TRT-LLM] [I] Memory used after loading model weights (inside torch) in memory usage profiling: 8.55 GiB
[07/23/2025-09:16:02] [TRT-LLM] [I] Memory used after loading model weights (outside torch) in memory usage profiling: 3.14 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Memory dynamically allocated during inference (inside torch) in memory usage profiling: 2.95 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Memory used outside torch (e.g., NCCL and CUDA graphs) in memory usage profiling: 3.78 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Peak memory during memory usage profiling (torch + non-torch): 15.29 GiB, available KV cache memory when calculating max tokens: 52.51 GiB, fraction is set 0.800000011920929, kv size is 65536, device total memory 79.19 GiB, , tmp kv_mem 1.73 GiB
[07/23/2025-09:16:03] [TRT-LLM] [I] Estimated max tokens in KV cache : 860308
[07/23/2025-09:16:03] [TRT-LLM] [W] Both free_gpu_memory_fraction and max_tokens are set (to 0.800000011920929 and 860308, respectively). The smaller value will be used.
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 5 [window size=157]
[TensorRT-LLM][INFO] Number of tokens per block: 32.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 52.51 GiB for max tokens in paged KV cache (860320).
[07/23/2025-09:16:03] [TRT-LLM] [I] max_seq_len=157, max_num_requests=4096, max_num_tokens=23552
[07/23/2025-09:16:03] [TRT-LLM] [I] [Autotuner]: Autotuning process starts ...
[07/23/2025-09:16:03] [TRT-LLM] [I] Run autotuning warmup for batch size=1
[07/23/2025-09:16:03] [TRT-LLM] [I] Autotuner Cache size after warmup 0
[07/23/2025-09:16:03] [TRT-LLM] [I] [Autotuner]: Autotuning process ends
[07/23/2025-09:16:03] [TRT-LLM] [I] Creating CUDA graph instances for 24 batch sizes.
[07/23/2025-09:16:04] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=4096
[07/23/2025-09:16:05] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=2048
[07/23/2025-09:16:05] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=1024
[07/23/2025-09:16:06] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=512
[07/23/2025-09:16:06] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=256
[07/23/2025-09:16:06] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=128
[07/23/2025-09:16:07] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=120
[07/23/2025-09:16:07] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=112
[07/23/2025-09:16:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=104
[07/23/2025-09:16:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=96
[07/23/2025-09:16:08] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=88
[07/23/2025-09:16:09] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=80
[07/23/2025-09:16:09] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=72
[07/23/2025-09:16:10] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=64
[07/23/2025-09:16:10] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=56
[07/23/2025-09:16:10] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=48
[07/23/2025-09:16:11] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=40
[07/23/2025-09:16:11] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=32
[07/23/2025-09:16:11] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=24
[07/23/2025-09:16:12] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=16
[07/23/2025-09:16:12] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=8
[07/23/2025-09:16:12] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=4
[07/23/2025-09:16:13] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=2
[07/23/2025-09:16:13] [TRT-LLM] [I] Run generation only CUDA graph warmup for batch size=1
[07/23/2025-09:16:13] [TRT-LLM] [I] Setting up for warmup...
[07/23/2025-09:16:13] [TRT-LLM] [I] Running warmup.
[07/23/2025-09:16:13] [TRT-LLM] [I] Starting benchmarking async task.
[07/23/2025-09:16:13] [TRT-LLM] [I] Starting benchmark...
[07/23/2025-09:16:13] [TRT-LLM] [I] Request submission complete. [count=2, time=0.0001s, rate=30122.30 req/s]
[07/23/2025-09:16:14] [TRT-LLM] [I] Benchmark complete.
[07/23/2025-09:16:14] [TRT-LLM] [I] Stopping LLM backend.
[07/23/2025-09:16:14] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[07/23/2025-09:16:14] [TRT-LLM] [I] All tasks cancelled.
[07/23/2025-09:16:14] [TRT-LLM] [I] LLM Backend stopped.
[07/23/2025-09:16:14] [TRT-LLM] [I] Worker task cancelled.
[07/23/2025-09:16:14] [TRT-LLM] [I] Warmup done.
[07/23/2025-09:16:14] [TRT-LLM] [I] No log path provided, skipping logging.
[07/23/2025-09:16:14] [TRT-LLM] [I] Starting benchmarking async task.
[07/23/2025-09:16:14] [TRT-LLM] [I] Starting benchmark...
[07/23/2025-09:16:14] [TRT-LLM] [I] Request submission complete. [count=30000, time=0.1026s, rate=292499.08 req/s]
[07/23/2025-09:18:13] [TRT-LLM] [I] Benchmark complete.
[07/23/2025-09:18:13] [TRT-LLM] [I] Stopping LLM backend.
[07/23/2025-09:18:13] [TRT-LLM] [I] Cancelling all 0 tasks to complete.
[07/23/2025-09:18:13] [TRT-LLM] [I] All tasks cancelled.
[07/23/2025-09:18:13] [TRT-LLM] [I] LLM Backend stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Worker task cancelled.
[07/23/2025-09:18:13] [TRT-LLM] [I] Benchmark done. Reporting results...
[07/23/2025-09:18:13] [TRT-LLM] [I] Validating KV Cache config against kv_cache_dtype="auto"
[07/23/2025-09:18:13] [TRT-LLM] [I] KV cache quantization set to "auto". Using checkpoint KV quantization.
[07/23/2025-09:18:13] [TRT-LLM] [I]

===========================================================
= PYTORCH BACKEND
===========================================================
Model:			nvidia/Llama-3.1-8B-Instruct-FP8
Model Path:		None
TensorRT-LLM Version:	0.20.0
Dtype:			bfloat16
KV Cache Dtype:		FP8
Quantization:		FP8

===========================================================
= REQUEST DETAILS
===========================================================
Number of requests:             30000
Number of concurrent requests:  14114.5246
Average Input Length (tokens):  128.0000
Average Output Length (tokens): 28.0000
===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
EP Size:                None
Max Runtime Batch Size: 4096
Max Runtime Tokens:     23552
Scheduling Policy:      GUARANTEED_NO_EVICT
KV Memory Percentage:   80.00%
Issue Rate (req/sec):   3.4376E+15

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec):                     253.5548
Total Output Throughput (tokens/sec):             7099.5345
Total Token Throughput (tokens/sec):              39554.5494
Total Latency (ms):                               118317.6163
Average request latency (ms):                     55666.5636
Per User Output Throughput [w/ ctx] (tps/user):   0.6702
Per GPU Output Throughput (tps/gpu):              7099.5345

-- Request Latency Breakdown (ms) -----------------------

[Latency] P50    : 54422.3568
[Latency] P90    : 94512.0713
[Latency] P95    : 98713.1150
[Latency] P99    : 102475.5310
[Latency] MINIMUM: 15187.1518
[Latency] MAXIMUM: 103317.3451
[Latency] AVERAGE: 55666.5636

===========================================================
= DATASET DETAILS
===========================================================
Dataset Path:         /workspace/datasets/nvidia/Llama-3.1-8B-Instruct-FP8/128_128_30000.jsonl
Number of Sequences:  30000

-- Percentiles statistics ---------------------------------

        Input              Output           Seq. Length
-----------------------------------------------------------
MIN:   128.0000            28.0000           156.0000
MAX:   128.0000            28.0000           156.0000
AVG:   128.0000            28.0000           156.0000
P50:   128.0000            28.0000           156.0000
P90:   128.0000            28.0000           156.0000
P95:   128.0000            28.0000           156.0000
P99:   128.0000            28.0000           156.0000
===========================================================

[07/23/2025-09:18:13] [TRT-LLM] [I] Thread proxy_dispatch_result_thread stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Thread proxy_dispatch_stats_thread stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Thread proxy_dispatch_kv_cache_events_thread stopped.
[07/23/2025-09:18:13] [TRT-LLM] [I] Thread await_response_thread stopped.
[07/23/2025-09:18:14] [TRT-LLM] [I] Thread dispatch_stats_thread stopped.
[07/23/2025-09:18:14] [TRT-LLM] [I] Thread dispatch_kv_cache_events_thread stopped.

OOM If I set kv_cache_free_gpu_mem_fraction to 0.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions