Skip to content

Commit 6e48ac2

Browse files
authored
chore: remove cuda_graph_ prefix from cuda_graph_config filed members. (#5585)
Signed-off-by: nv-guomingz <[email protected]>
1 parent 16fc993 commit 6e48ac2

File tree

16 files changed

+193
-210
lines changed

16 files changed

+193
-210
lines changed

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -196,20 +196,20 @@ We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers
196196
```bash
197197
cat >./extra-llm-api-config.yml <<EOF
198198
pytorch_backend_config:
199-
use_cuda_graph: true
200-
cuda_graph_padding_enabled: true
201-
cuda_graph_batch_sizes:
202-
- 896
203-
- 512
204-
- 256
205-
- 128
206-
- 64
207-
- 32
208-
- 16
209-
- 8
210-
- 4
211-
- 2
212-
- 1
199+
cuda_graph_config:
200+
padding_enabled: true
201+
batch_sizes:
202+
- 896
203+
- 512
204+
- 256
205+
- 128
206+
- 64
207+
- 32
208+
- 16
209+
- 8
210+
- 4
211+
- 2
212+
- 1
213213
print_iter_log: true
214214
kv_cache_dtype: fp8
215215
enable_attention_dp: true
@@ -264,19 +264,19 @@ YOUR_DATA_PATH=./dataset.txt
264264

265265
cat >./extra-llm-api-config.yml <<EOF
266266
pytorch_backend_config:
267-
use_cuda_graph: true
268-
cuda_graph_padding_enabled: true
269-
cuda_graph_batch_sizes:
270-
- 1
271-
- 2
272-
- 4
273-
- 8
274-
- 16
275-
- 32
276-
- 64
277-
- 128
278-
- 256
279-
- 384
267+
cuda_graph_config:
268+
padding_enabled: true
269+
batch_sizes:
270+
- 1
271+
- 2
272+
- 4
273+
- 8
274+
- 16
275+
- 32
276+
- 64
277+
- 128
278+
- 256
279+
- 384
280280
print_iter_log: ${PRINT_ITER_LOG}
281281
enable_attention_dp: true
282282
EOF

docs/source/performance/perf-overview.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -200,24 +200,24 @@ trtllm-bench --model $model_name throughput --dataset $dataset_file --backend py
200200

201201
`llm_options.yml`
202202
```yaml
203-
use_cuda_graph: true
204-
cuda_graph_padding_enabled: true
205-
cuda_graph_batch_sizes:
206-
- 1
207-
- 2
208-
- 4
209-
- 8
210-
- 16
211-
- 32
212-
- 64
213-
- 128
214-
- 256
215-
- 384
216-
- 512
217-
- 1024
218-
- 2048
219-
- 4096
220-
- 8192
203+
cuda_graph_config:
204+
padding_enabled: true
205+
batch_sizes:
206+
- 1
207+
- 2
208+
- 4
209+
- 8
210+
- 16
211+
- 32
212+
- 64
213+
- 128
214+
- 256
215+
- 384
216+
- 512
217+
- 1024
218+
- 2048
219+
- 4096
220+
- 8192
221221
```
222222
223223
In majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.

examples/llm-api/llm_mgmn_trtllm_bench.sh

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,6 @@ srun -l \
7474
7575
# This is optional
7676
cat > /tmp/pytorch_extra_args.txt << EOF
77-
use_cuda_graph: false
78-
cuda_graph_padding_enabled: false
7977
print_iter_log: true
8078
enable_attention_dp: false
8179
EOF

examples/models/core/deepseek_v3/README.md

Lines changed: 45 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -141,9 +141,9 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
141141
--num-requests 24 > /tmp/benchmarking_64k.txt
142142

143143
cat <<EOF > /tmp/extra-llm-api-config.yml
144-
use_cuda_graph: true
145-
cuda_graph_padding_enabled: true
146-
cuda_graph_batch_sizes: [1, 4, 8, 12]
144+
cuda_graph_config:
145+
padding_enabled: true
146+
batch_sizes: [1, 4, 8, 12]
147147
EOF
148148

149149
trtllm-bench -m deepseek-ai/DeepSeek-R1 --model_path ${DS_R1_NVFP4_MODEL_PATH} throughput \
@@ -168,9 +168,9 @@ python /app/tensorrt_llm/benchmarks/cpp/prepare_dataset.py \
168168
--num-requests 4 > /tmp/benchmarking_128k.txt
169169

170170
cat <<EOF > /tmp/extra-llm-api-config.yml
171-
use_cuda_graph: true
172-
cuda_graph_padding_enabled: true
173-
cuda_graph_batch_sizes: [1, 2]
171+
cuda_graph_config:
172+
padding_enabled: true
173+
batch_sizes: [1, 2]
174174
moe_max_num_tokens: 16384
175175
EOF
176176

@@ -236,19 +236,19 @@ To serve the model using `trtllm-serve`:
236236

237237
```bash
238238
cat >./extra-llm-api-config.yml <<EOF
239-
use_cuda_graph: true
240-
cuda_graph_padding_enabled: true
241-
cuda_graph_batch_sizes:
242-
- 1
243-
- 2
244-
- 4
245-
- 8
246-
- 16
247-
- 32
248-
- 64
249-
- 128
250-
- 256
251-
- 384
239+
cuda_graph_config:
240+
padding_enabled: true
241+
batch_sizes:
242+
- 1
243+
- 2
244+
- 4
245+
- 8
246+
- 16
247+
- 32
248+
- 64
249+
- 128
250+
- 256
251+
- 384
252252
print_iter_log: true
253253
enable_attention_dp: true
254254
EOF
@@ -315,19 +315,19 @@ And you can launch two generation servers on port 8002 and 8003 with:
315315
export TRTLLM_USE_UCX_KVCACHE=1
316316

317317
cat >./gen-extra-llm-api-config.yml <<EOF
318-
use_cuda_graph: true
319-
cuda_graph_padding_enabled: true
320-
cuda_graph_batch_sizes:
321-
- 1
322-
- 2
323-
- 4
324-
- 8
325-
- 16
326-
- 32
327-
- 64
328-
- 128
329-
- 256
330-
- 384
318+
cuda_graph_config:
319+
padding_enabled: true
320+
batch_sizes:
321+
- 1
322+
- 2
323+
- 4
324+
- 8
325+
- 16
326+
- 32
327+
- 64
328+
- 128
329+
- 256
330+
- 384
331331
print_iter_log: true
332332
enable_attention_dp: true
333333
EOF
@@ -537,19 +537,19 @@ python3 /path/to/TensorRT-LLM/benchmarks/cpp/prepare_dataset.py \
537537
--input-mean=1024 --output-mean=2048 --input-stdev=0 --output-stdev=0 > /tmp/dataset.txt
538538

539539
cat >/path/to/TensorRT-LLM/extra-llm-api-config.yml <<EOF
540-
use_cuda_graph: true
541-
cuda_graph_padding_enabled: true
542-
cuda_graph_batch_sizes:
543-
- 1
544-
- 2
545-
- 4
546-
- 8
547-
- 16
548-
- 32
549-
- 64
550-
- 128
551-
- 256
552-
- 384
540+
cuda_graph_config:
541+
padding_enabled: true
542+
batch_sizes:
543+
- 1
544+
- 2
545+
- 4
546+
- 8
547+
- 16
548+
- 32
549+
- 64
550+
- 128
551+
- 256
552+
- 384
553553
print_iter_log: true
554554
enable_attention_dp: true
555555
EOF

examples/models/core/qwen/README.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -733,19 +733,19 @@ To serve the model using `trtllm-serve`:
733733

734734
```bash
735735
cat >./extra-llm-api-config.yml <<EOF
736-
use_cuda_graph: true
737-
cuda_graph_padding_enabled: true
738-
cuda_graph_batch_sizes:
739-
- 1
740-
- 2
741-
- 4
742-
- 8
743-
- 16
744-
- 32
745-
- 64
746-
- 128
747-
- 256
748-
- 384
736+
cuda_graph_config:
737+
padding_enabled: true
738+
batch_sizes:
739+
- 1
740+
- 2
741+
- 4
742+
- 8
743+
- 16
744+
- 32
745+
- 64
746+
- 128
747+
- 256
748+
- 384
749749
print_iter_log: true
750750
enable_attention_dp: true
751751
EOF
@@ -809,19 +809,19 @@ And you can launch two generation servers on port 8002 and 8003 with:
809809
export TRTLLM_USE_UCX_KVCACHE=1
810810

811811
cat >./gen-extra-llm-api-config.yml <<EOF
812-
use_cuda_graph: true
813-
cuda_graph_padding_enabled: true
814-
cuda_graph_batch_sizes:
815-
- 1
816-
- 2
817-
- 4
818-
- 8
819-
- 16
820-
- 32
821-
- 64
822-
- 128
823-
- 256
824-
- 384
812+
cuda_graph_config:
813+
padding_enabled: true
814+
batch_sizes:
815+
- 1
816+
- 2
817+
- 4
818+
- 8
819+
- 16
820+
- 32
821+
- 64
822+
- 128
823+
- 256
824+
- 384
825825
print_iter_log: true
826826
enable_attention_dp: true
827827
EOF

examples/pytorch/quickstart_advanced.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -187,8 +187,8 @@ def setup_llm(args):
187187
spec_config = None
188188

189189
cuda_graph_config = CudaGraphConfig(
190-
cuda_graph_batch_sizes=args.cuda_graph_batch_sizes,
191-
cuda_graph_padding_enabled=args.cuda_graph_padding_enabled,
190+
batch_sizes=args.cuda_graph_batch_sizes,
191+
padding_enabled=args.cuda_graph_padding_enabled,
192192
) if args.use_cuda_graph else None
193193
llm = LLM(
194194
model=args.model_dir,

tensorrt_llm/_torch/auto_deploy/transformations/transform.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,10 +194,7 @@ def __call__(self, cm: CachedSequenceInterface) -> GraphModule:
194194

195195
cm.info.set_generate_only_batch()
196196
compiler_kwargs = {
197-
"cuda_graph_batch_sizes": self.ad_config.cuda_graph_config.cuda_graph_batch_sizes
198-
if hasattr(self.ad_config, "cuda_graph_config")
199-
and self.ad_config.cuda_graph_config is not None
200-
else None,
197+
"cuda_graph_batch_sizes": self.ad_config.cuda_graph_batch_sizes,
201198
"num_batched_inputs": 2, # TODO (lucaslie): improve once we have a config system...
202199
}
203200
egm_compiled = compile_and_capture(

tensorrt_llm/bench/benchmark/utils/general.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -149,9 +149,9 @@ def get_settings(params: dict, dataset_metadata: DatasetMetadata, model: str,
149149

150150
pyt_options = {
151151
"cuda_graph_config": {
152-
"cuda_graph_padding_enabled":
152+
"padding_enabled":
153153
True,
154-
"cuda_graph_max_batch_size":
154+
"max_batch_size":
155155
max_batch_size if cuda_graph_batch_sizes is None else 0,
156156
},
157157
"kv_cache_dtype": kv_cache_dtype,

0 commit comments

Comments
 (0)