You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Starting triton with the openai API frontend python3 /root/.cache/deps/triton_repo/server/python/openai/openai_frontend/main.py --model-repository $MODEL_REPO --tokenizer $TOKENIZER_PATH --tritonserver-log-verbose-level=1
using the genai perf script to send concurrent requests to the ensemble model genai-perf profile ...
nv_inference_exec_count{model="ensemble",version="1"} 232
is smaller than
nv_inference_request_success{model="ensemble",version="1"} 232
additional notes
some triton logs after server start:
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 864.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.18 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.51 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 57.33 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 6605
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2048
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 51.60 GiB for max tokens in paged KV cache (422720).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
some triton logs after server stop:
Shutting down Triton OpenAI-Compatible Frontend...
Shutting down Triton Inference Server...
I1222 18:24:43.350231 2239 server.cc:305] "Waiting for in-flight requests to complete."
I1222 18:24:43.350291 2239 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I1222 18:24:43.351900 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for postprocessing..."
I1222 18:24:43.351925 2239 server.cc:336] "All models are stopped, unloading models"
I1222 18:24:43.351948 2239 server.cc:345] "Timeout 30: Found 5 live models and 0 in-flight non-inference requests"
I1222 18:24:43.351952 2239 server.cc:351] "ensemble v1: UNLOADING"
I1222 18:24:43.351959 2239 server.cc:351] "postprocessing v1: UNLOADING"
I1222 18:24:43.351961 2239 server.cc:351] "preprocessing v1: UNLOADING"
I1222 18:24:43.351964 2239 server.cc:351] "tensorrt_llm v1: UNLOADING"
I1222 18:24:43.351967 2239 server.cc:351] "tensorrt_llm_bls v1: UNLOADING"
I1222 18:24:43.352050 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for tensorrt_llm..."
I1222 18:24:43.352085 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_bls_0_0..."
I1222 18:24:43.352109 2239 backend_model_instance.cc:807] "Stopping backend thread for postprocessing_0_1..."
I1222 18:24:43.352157 2239 model_lifecycle.cc:636] "successfully unloaded 'ensemble' version 1"
I1222 18:24:43.352168 2239 backend_model_instance.cc:807] "Stopping backend thread for preprocessing_0_2..."
I1222 18:24:43.352270 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352367 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352394 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352402 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_0_0..."
[TensorRT-LLM][INFO] Orchestrator sendReq thread exiting
[TensorRT-LLM][INFO] Orchestrator recv thread exiting
[TensorRT-LLM][INFO] Leader recvReq thread exiting
[TensorRT-LLM][INFO] Leader sendThread exiting
I1222 18:24:43.524265 2239 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm' version 1"
[TensorRT-LLM][INFO] Refreshed the MPI local session
Let me know if you need any more information. I would like to understand why as default the
dynamic_batching {
preferred_batch_size: [ 1024 ]
max_queue_delay_microseconds: 1000000
}
is not set by the scripts in the example and do I need to set them for inflight batching?
The text was updated successfully, but these errors were encountered:
System Info
NVIDIA H100 80GB HBM3
| server_version | 2.52.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
Who can help?
@kaiyu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I have followed the official example from:
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/README.html#prepare-the-model-repository
I have setup the inflight_batcher_llm model repository from here:
https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm
and I am using the openai frontend infront of it:
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client_guide/openai_readme.html
https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8
Running the checkpoint python script
python3 convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \ --output_dir ${UNIFIED_CKPT_PATH} \ --dtype float16
Building the tensorrt-llm engine
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \ --remove_input_padding enable \ --gpt_attention_plugin float16 \ --context_fmha enable \ --gemm_plugin float16 \ --output_dir ${ENGINE_PATH} \ --kv_cache_type paged \ --max_batch_size 1024
creating the model repository for inflight_batcher_llm model repository from here:
https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm
Starting triton with the openai API frontend
python3 /root/.cache/deps/triton_repo/server/python/openai/openai_frontend/main.py --model-repository $MODEL_REPO --tokenizer $TOKENIZER_PATH --tritonserver-log-verbose-level=1
using the genai perf script to send concurrent requests to the ensemble model
genai-perf profile ...
The config.pbtxt file for tensorrt_llm
tensorrt_llm
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 1024
model_transaction_policy {
decoupled: true
}
dynamic_batching {
preferred_batch_size: [ 1024 ]
max_queue_delay_microseconds: 1000000
}
input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
allow_ragged_batch: true
optional: true
},
{
name: "encoder_input_features"
data_type: TYPE_FP16
dims: [ -1, -1 ]
allow_ragged_batch: true
optional: true
},
{
name: "encoder_output_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "num_return_sequences"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "draft_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
reshape: { shape: [ ] }
},
{
name: "draft_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "draft_acceptance_threshold"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "embedding_bias"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_k"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_min"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_decay"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_reset_ids"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "early_stopping"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_kv_cache_reuse_stats"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "exclude_input_in_output"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "streaming"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "prompt_table_extra_ids"
data_type: TYPE_UINT64
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "cross_attention_mask"
data_type: TYPE_BOOL
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "lora_task_id"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "lora_weights"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "lora_config"
data_type: TYPE_INT32
dims: [ -1, 3 ]
optional: true
allow_ragged_batch: true
},
{
name: "context_phase_params"
data_type: TYPE_UINT8
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "skip_cross_attn_blocks"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
allow_ragged_batch: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "sequence_index"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "context_phase_params"
data_type: TYPE_UINT8
dims: [ -1 ]
},
{
name: "kv_cache_alloc_new_blocks"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "kv_cache_reused_blocks"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "kv_cache_alloc_total_blocks"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "${max_beam_width}"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/root/.cache/deps/transformation/engines/llama_new1/1b/"
}
}
parameters: {
key: "encoder_model_path"
value: {
string_value: "${encoder_engine_dir}"
}
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: "${max_tokens_in_paged_kv_cache}"
}
}
parameters: {
key: "max_attention_window_size"
value: {
string_value: "${max_attention_window_size}"
}
}
parameters: {
key: "sink_token_length"
value: {
string_value: "${sink_token_length}"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "${batch_scheduler_policy}"
}
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: {
string_value: "${kv_cache_free_gpu_mem_fraction}"
}
}
parameters: {
key: "cross_kv_cache_fraction"
value: {
string_value: "${cross_kv_cache_fraction}"
}
}
parameters: {
key: "kv_cache_host_memory_bytes"
value: {
string_value: "${kv_cache_host_memory_bytes}"
}
}
parameters: {
key: "kv_cache_onboard_blocks"
value: {
string_value: "${kv_cache_onboard_blocks}"
}
}
parameters: {
key: "exclude_input_in_output"
value: {
string_value: "${exclude_input_in_output}"
}
}
parameters: {
key: "cancellation_check_period_ms"
value: {
string_value: "${cancellation_check_period_ms}"
}
}
parameters: {
key: "stats_check_period_ms"
value: {
string_value: "${stats_check_period_ms}"
}
}
parameters: {
key: "iter_stats_max_iterations"
value: {
string_value: "${iter_stats_max_iterations}"
}
}
parameters: {
key: "request_stats_max_iterations"
value: {
string_value: "${request_stats_max_iterations}"
}
}
parameters: {
key: "enable_kv_cache_reuse"
value: {
string_value: "${enable_kv_cache_reuse}"
}
}
parameters: {
key: "normalize_log_probs"
value: {
string_value: "${normalize_log_probs}"
}
}
parameters: {
key: "enable_chunked_context"
value: {
string_value: "${enable_chunked_context}"
}
}
parameters: {
key: "gpu_device_ids"
value: {
string_value: "${gpu_device_ids}"
}
}
parameters: {
key: "participant_ids"
value: {
string_value: "${participant_ids}"
}
}
parameters: {
key: "lora_cache_optimal_adapter_size"
value: {
string_value: "${lora_cache_optimal_adapter_size}"
}
}
parameters: {
key: "lora_cache_max_adapter_size"
value: {
string_value: "${lora_cache_max_adapter_size}"
}
}
parameters: {
key: "lora_cache_gpu_memory_fraction"
value: {
string_value: "${lora_cache_gpu_memory_fraction}"
}
}
parameters: {
key: "lora_cache_host_memory_bytes"
value: {
string_value: "${lora_cache_host_memory_bytes}"
}
}
parameters: {
key: "decoding_mode"
value: {
string_value: "${decoding_mode}"
}
}
parameters: {
key: "executor_worker_path"
value: {
string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
}
}
parameters: {
key: "medusa_choices"
value: {
string_value: "${medusa_choices}"
}
}
parameters: {
key: "eagle_choices"
value: {
string_value: "${eagle_choices}"
}
}
parameters: {
key: "gpu_weights_percent"
value: {
string_value: "${gpu_weights_percent}"
}
}
parameters: {
key: "enable_context_fmha_fp32_acc"
value: {
string_value: "${enable_context_fmha_fp32_acc}"
}
}
parameters: {
key: "multi_block_mode"
value: {
string_value: "${multi_block_mode}"
}
}
parameters: {
key: "cuda_graph_mode"
value: {
string_value: "${cuda_graph_mode}"
}
}
parameters: {
key: "cuda_graph_cache_size"
value: {
string_value: "${cuda_graph_cache_size}"
}
}
parameters: {
key: "speculative_decoding_fast_logits"
value: {
string_value: "${speculative_decoding_fast_logits}"
}
}
if you need the other configs from ensemble, preprocessing etc. let me know
Expected behavior
I should see some batching happening, but after sending concurrent requests the metrics endpoint shows me
nv_inference_request_success{model="ensemble",version="1"} 232
nv_inference_request_success{model="postprocessing",version="1"} 25181
nv_inference_request_success{model="preprocessing",version="1"} 232
nv_inference_request_success{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_success{model="tensorrt_llm",version="1"} 232
nv_inference_request_failure{model="ensemble",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="OTHER",version="1"} 0
nv_inference_count{model="ensemble",version="1"} 232
nv_inference_count{model="postprocessing",version="1"} 25181
nv_inference_count{model="preprocessing",version="1"} 232
nv_inference_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_count{model="tensorrt_llm",version="1"} 232
nv_inference_exec_count{model="ensemble",version="1"} 232
nv_inference_exec_count{model="postprocessing",version="1"} 11041
nv_inference_exec_count{model="preprocessing",version="1"} 232
nv_inference_exec_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_exec_count{model="tensorrt_llm",version="1"} 232
nv_inference_request_duration_us{model="ensemble",version="1"} 369358274
nv_inference_request_duration_us{model="postprocessing",version="1"} 63622688
nv_inference_request_duration_us{model="preprocessing",version="1"} 254386
nv_inference_request_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 368632046
nv_inference_queue_duration_us{model="ensemble",version="1"} 1293
nv_inference_queue_duration_us{model="postprocessing",version="1"} 20113756
nv_inference_queue_duration_us{model="preprocessing",version="1"} 25294
nv_inference_queue_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_queue_duration_us{model="tensorrt_llm",version="1"} 173117398
nv_inference_compute_input_duration_us{model="ensemble",version="1"} 3036172
nv_inference_compute_input_duration_us{model="postprocessing",version="1"} 999827
nv_inference_compute_input_duration_us{model="preprocessing",version="1"} 10749
nv_inference_compute_input_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_compute_input_duration_us{model="tensorrt_llm",version="1"} 2012983
nv_inference_compute_infer_duration_us{model="ensemble",version="1"} 197515178
nv_inference_compute_infer_duration_us{model="postprocessing",version="1"} 3992872
nv_inference_compute_infer_duration_us{model="preprocessing",version="1"} 121548
nv_inference_compute_infer_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_compute_infer_duration_us{model="tensorrt_llm",version="1"} 193387997
and this
nv_cpu_memory_used_bytes 180916031488
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="context",version="1"} 0
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"} 0
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="max",version="1"} 640
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"} 0
nv_trt_llm_runtime_memory_metrics{memory_type="pinned",model="tensorrt_llm",version="1"} 3556783876
nv_trt_llm_runtime_memory_metrics{memory_type="gpu",model="tensorrt_llm",version="1"} 61362490955
nv_trt_llm_runtime_memory_metrics{memory_type="cpu",model="tensorrt_llm",version="1"} 21532
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="reused",model="tensorrt_llm",version="1"} 0
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 0
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6605
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6605
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="paused_requests",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0
nv_trt_llm_general_metrics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 4785
nv_trt_llm_general_metrics{general_type="timestamp",model="tensorrt_llm",version="1"} 1734880624970454
actual behavior
I should see
nv_inference_exec_count{model="ensemble",version="1"} 232
is smaller than
nv_inference_request_success{model="ensemble",version="1"} 232
additional notes
some triton logs after server start:
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 864.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.18 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.51 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 57.33 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 6605
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2048
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 51.60 GiB for max tokens in paged KV cache (422720).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.
some triton logs after server stop:
Shutting down Triton OpenAI-Compatible Frontend...
Shutting down Triton Inference Server...
I1222 18:24:43.350231 2239 server.cc:305] "Waiting for in-flight requests to complete."
I1222 18:24:43.350291 2239 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I1222 18:24:43.351900 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for postprocessing..."
I1222 18:24:43.351925 2239 server.cc:336] "All models are stopped, unloading models"
I1222 18:24:43.351948 2239 server.cc:345] "Timeout 30: Found 5 live models and 0 in-flight non-inference requests"
I1222 18:24:43.351952 2239 server.cc:351] "ensemble v1: UNLOADING"
I1222 18:24:43.351959 2239 server.cc:351] "postprocessing v1: UNLOADING"
I1222 18:24:43.351961 2239 server.cc:351] "preprocessing v1: UNLOADING"
I1222 18:24:43.351964 2239 server.cc:351] "tensorrt_llm v1: UNLOADING"
I1222 18:24:43.351967 2239 server.cc:351] "tensorrt_llm_bls v1: UNLOADING"
I1222 18:24:43.352050 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for tensorrt_llm..."
I1222 18:24:43.352085 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_bls_0_0..."
I1222 18:24:43.352109 2239 backend_model_instance.cc:807] "Stopping backend thread for postprocessing_0_1..."
I1222 18:24:43.352157 2239 model_lifecycle.cc:636] "successfully unloaded 'ensemble' version 1"
I1222 18:24:43.352168 2239 backend_model_instance.cc:807] "Stopping backend thread for preprocessing_0_2..."
I1222 18:24:43.352270 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352367 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352394 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352402 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_0_0..."
[TensorRT-LLM][INFO] Orchestrator sendReq thread exiting
[TensorRT-LLM][INFO] Orchestrator recv thread exiting
[TensorRT-LLM][INFO] Leader recvReq thread exiting
[TensorRT-LLM][INFO] Leader sendThread exiting
I1222 18:24:43.524265 2239 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm' version 1"
[TensorRT-LLM][INFO] Refreshed the MPI local session
Let me know if you need any more information. I would like to understand why as default the
dynamic_batching {
preferred_batch_size: [ 1024 ]
max_queue_delay_microseconds: 1000000
}
is not set by the scripts in the example and do I need to set them for inflight batching?
The text was updated successfully, but these errors were encountered: