Inflight Batching not working with OpenAI-Compatible Frontend #667

frosk1 · 2024-12-22T15:33:51Z

System Info

NVIDIA H100 80GB HBM3
| server_version | 2.52.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |

Who can help?

@kaiyu

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I have followed the official example from:
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/tensorrtllm_backend/README.html#prepare-the-model-repository

I have setup the inflight_batcher_llm model repository from here:
https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm

and I am using the openai frontend infront of it:
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client_guide/openai_readme.html

Download the model from HG

https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8

Running the checkpoint python script
python3 convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \ --output_dir ${UNIFIED_CKPT_PATH} \ --dtype float16
Building the tensorrt-llm engine
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \ --remove_input_padding enable \ --gpt_attention_plugin float16 \ --context_fmha enable \ --gemm_plugin float16 \ --output_dir ${ENGINE_PATH} \ --kv_cache_type paged \ --max_batch_size 1024
creating the model repository for inflight_batcher_llm model repository from here:
https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm
Starting triton with the openai API frontend
python3 /root/.cache/deps/triton_repo/server/python/openai/openai_frontend/main.py --model-repository $MODEL_REPO --tokenizer $TOKENIZER_PATH --tritonserver-log-verbose-level=1
using the genai perf script to send concurrent requests to the ensemble model
genai-perf profile ...

The config.pbtxt file for tensorrt_llm

tensorrt_llm
name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 1024

model_transaction_policy {
decoupled: true
}

dynamic_batching {
preferred_batch_size: [ 1024 ]
max_queue_delay_microseconds: 1000000
}

input [
{
name: "input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
allow_ragged_batch: true
optional: true
},
{
name: "encoder_input_features"
data_type: TYPE_FP16
dims: [ -1, -1 ]
allow_ragged_batch: true
optional: true
},
{
name: "encoder_output_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "num_return_sequences"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "draft_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_ids"
data_type: TYPE_INT32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "decoder_input_lengths"
data_type: TYPE_INT32
dims: [ 1 ]
optional: true
reshape: { shape: [ ] }
},
{
name: "draft_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "draft_acceptance_threshold"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "pad_id"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "embedding_bias"
data_type: TYPE_FP32
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "beam_width"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_k"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_min"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_decay"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p_reset_ids"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "early_stopping"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "min_length"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "presence_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "frequency_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_context_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_generation_logits"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "return_kv_cache_reuse_stats"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "exclude_input_in_output"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "stop"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "streaming"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "prompt_embedding_table"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "prompt_table_extra_ids"
data_type: TYPE_UINT64
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},
{
name: "prompt_vocab_size"
data_type: TYPE_INT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "cross_attention_mask"
data_type: TYPE_BOOL
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},

{
name: "lora_task_id"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},

{
name: "lora_weights"
data_type: TYPE_FP16
dims: [ -1, -1 ]
optional: true
allow_ragged_batch: true
},

{
name: "lora_config"
data_type: TYPE_INT32
dims: [ -1, 3 ]
optional: true
allow_ragged_batch: true
},
{
name: "context_phase_params"
data_type: TYPE_UINT8
dims: [ -1 ]
optional: true
allow_ragged_batch: true
},

{
name: "skip_cross_attn_blocks"
data_type: TYPE_BOOL
dims: [ 1 ]
optional: true
allow_ragged_batch: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_INT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_INT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "context_logits"
data_type: TYPE_FP32
dims: [ -1, -1 ]
},
{
name: "generation_logits"
data_type: TYPE_FP32
dims: [ -1, -1, -1 ]
},
{
name: "batch_index"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "sequence_index"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "context_phase_params"
data_type: TYPE_UINT8
dims: [ -1 ]
},
{
name: "kv_cache_alloc_new_blocks"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "kv_cache_reused_blocks"
data_type: TYPE_INT32
dims: [ 1 ]
},
{
name: "kv_cache_alloc_total_blocks"
data_type: TYPE_INT32
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind : KIND_CPU
}
]
parameters: {
key: "max_beam_width"
value: {
string_value: "${max_beam_width}"
}
}
parameters: {
key: "FORCE_CPU_ONLY_INPUT_TENSORS"
value: {
string_value: "no"
}
}
parameters: {
key: "gpt_model_type"
value: {
string_value: "inflight_fused_batching"
}
}
parameters: {
key: "gpt_model_path"
value: {
string_value: "/root/.cache/deps/transformation/engines/llama_new1/1b/"
}
}
parameters: {
key: "encoder_model_path"
value: {
string_value: "${encoder_engine_dir}"
}
}
parameters: {
key: "max_tokens_in_paged_kv_cache"
value: {
string_value: "${max_tokens_in_paged_kv_cache}"
}
}
parameters: {
key: "max_attention_window_size"
value: {
string_value: "${max_attention_window_size}"
}
}
parameters: {
key: "sink_token_length"
value: {
string_value: "${sink_token_length}"
}
}
parameters: {
key: "batch_scheduler_policy"
value: {
string_value: "${batch_scheduler_policy}"
}
}
parameters: {
key: "kv_cache_free_gpu_mem_fraction"
value: {
string_value: "${kv_cache_free_gpu_mem_fraction}"
}
}
parameters: {
key: "cross_kv_cache_fraction"
value: {
string_value: "${cross_kv_cache_fraction}"
}
}
parameters: {
key: "kv_cache_host_memory_bytes"
value: {
string_value: "${kv_cache_host_memory_bytes}"
}
}

parameters: {
key: "kv_cache_onboard_blocks"
value: {
string_value: "${kv_cache_onboard_blocks}"
}
}

parameters: {
key: "exclude_input_in_output"
value: {
string_value: "${exclude_input_in_output}"
}
}
parameters: {
key: "cancellation_check_period_ms"
value: {
string_value: "${cancellation_check_period_ms}"
}
}
parameters: {
key: "stats_check_period_ms"
value: {
string_value: "${stats_check_period_ms}"
}
}
parameters: {
key: "iter_stats_max_iterations"
value: {
string_value: "${iter_stats_max_iterations}"
}
}
parameters: {
key: "request_stats_max_iterations"
value: {
string_value: "${request_stats_max_iterations}"
}
}
parameters: {
key: "enable_kv_cache_reuse"
value: {
string_value: "${enable_kv_cache_reuse}"
}
}
parameters: {
key: "normalize_log_probs"
value: {
string_value: "${normalize_log_probs}"
}
}
parameters: {
key: "enable_chunked_context"
value: {
string_value: "${enable_chunked_context}"
}
}
parameters: {
key: "gpu_device_ids"
value: {
string_value: "${gpu_device_ids}"
}
}
parameters: {
key: "participant_ids"
value: {
string_value: "${participant_ids}"
}
}
parameters: {
key: "lora_cache_optimal_adapter_size"
value: {
string_value: "${lora_cache_optimal_adapter_size}"
}
}
parameters: {
key: "lora_cache_max_adapter_size"
value: {
string_value: "${lora_cache_max_adapter_size}"
}
}
parameters: {
key: "lora_cache_gpu_memory_fraction"
value: {
string_value: "${lora_cache_gpu_memory_fraction}"
}
}
parameters: {
key: "lora_cache_host_memory_bytes"
value: {
string_value: "${lora_cache_host_memory_bytes}"
}
}
parameters: {
key: "decoding_mode"
value: {
string_value: "${decoding_mode}"
}
}
parameters: {
key: "executor_worker_path"
value: {
string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
}
}
parameters: {
key: "medusa_choices"
value: {
string_value: "${medusa_choices}"
}
}
parameters: {
key: "eagle_choices"
value: {
string_value: "${eagle_choices}"
}
}
parameters: {
key: "gpu_weights_percent"
value: {
string_value: "${gpu_weights_percent}"
}
}
parameters: {
key: "enable_context_fmha_fp32_acc"
value: {
string_value: "${enable_context_fmha_fp32_acc}"
}
}
parameters: {
key: "multi_block_mode"
value: {
string_value: "${multi_block_mode}"
}
}
parameters: {
key: "cuda_graph_mode"
value: {
string_value: "${cuda_graph_mode}"
}
}
parameters: {
key: "cuda_graph_cache_size"
value: {
string_value: "${cuda_graph_cache_size}"
}
}
parameters: {
key: "speculative_decoding_fast_logits"
value: {
string_value: "${speculative_decoding_fast_logits}"
}
}

if you need the other configs from ensemble, preprocessing etc. let me know

Expected behavior

I should see some batching happening, but after sending concurrent requests the metrics endpoint shows me

nv_inference_request_success{model="ensemble",version="1"} 232
nv_inference_request_success{model="postprocessing",version="1"} 25181
nv_inference_request_success{model="preprocessing",version="1"} 232
nv_inference_request_success{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_success{model="tensorrt_llm",version="1"} 232

nv_inference_request_failure{model="ensemble",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="ensemble",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="OTHER",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="tensorrt_llm_bls",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="CANCELED",version="1"} 0
nv_inference_request_failure{model="preprocessing",reason="REJECTED",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="BACKEND",version="1"} 0
nv_inference_request_failure{model="postprocessing",reason="OTHER",version="1"} 0

nv_inference_count{model="ensemble",version="1"} 232
nv_inference_count{model="postprocessing",version="1"} 25181
nv_inference_count{model="preprocessing",version="1"} 232
nv_inference_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_count{model="tensorrt_llm",version="1"} 232

nv_inference_exec_count{model="ensemble",version="1"} 232
nv_inference_exec_count{model="postprocessing",version="1"} 11041
nv_inference_exec_count{model="preprocessing",version="1"} 232
nv_inference_exec_count{model="tensorrt_llm_bls",version="1"} 0
nv_inference_exec_count{model="tensorrt_llm",version="1"} 232

nv_inference_request_duration_us{model="ensemble",version="1"} 369358274
nv_inference_request_duration_us{model="postprocessing",version="1"} 63622688
nv_inference_request_duration_us{model="preprocessing",version="1"} 254386
nv_inference_request_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_request_duration_us{model="tensorrt_llm",version="1"} 368632046

nv_inference_queue_duration_us{model="ensemble",version="1"} 1293
nv_inference_queue_duration_us{model="postprocessing",version="1"} 20113756
nv_inference_queue_duration_us{model="preprocessing",version="1"} 25294
nv_inference_queue_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_queue_duration_us{model="tensorrt_llm",version="1"} 173117398

nv_inference_compute_input_duration_us{model="ensemble",version="1"} 3036172
nv_inference_compute_input_duration_us{model="postprocessing",version="1"} 999827
nv_inference_compute_input_duration_us{model="preprocessing",version="1"} 10749
nv_inference_compute_input_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_compute_input_duration_us{model="tensorrt_llm",version="1"} 2012983

nv_inference_compute_infer_duration_us{model="ensemble",version="1"} 197515178
nv_inference_compute_infer_duration_us{model="postprocessing",version="1"} 3992872
nv_inference_compute_infer_duration_us{model="preprocessing",version="1"} 121548
nv_inference_compute_infer_duration_us{model="tensorrt_llm_bls",version="1"} 0
nv_inference_compute_infer_duration_us{model="tensorrt_llm",version="1"} 193387997

and this

nv_cpu_memory_used_bytes 180916031488

nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="context",version="1"} 0
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="scheduled",version="1"} 0
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="max",version="1"} 640
nv_trt_llm_request_metrics{model="tensorrt_llm",request_type="active",version="1"} 0

nv_trt_llm_runtime_memory_metrics{memory_type="pinned",model="tensorrt_llm",version="1"} 3556783876
nv_trt_llm_runtime_memory_metrics{memory_type="gpu",model="tensorrt_llm",version="1"} 61362490955
nv_trt_llm_runtime_memory_metrics{memory_type="cpu",model="tensorrt_llm",version="1"} 21532

nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="reused",model="tensorrt_llm",version="1"} 0
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="tokens_per",model="tensorrt_llm",version="1"} 64
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="used",model="tensorrt_llm",version="1"} 0
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="free",model="tensorrt_llm",version="1"} 6605
nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type="max",model="tensorrt_llm",version="1"} 6605

nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="paused_requests",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="micro_batch_id",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="generation_requests",model="tensorrt_llm",version="1"} 0
nv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric="total_context_tokens",model="tensorrt_llm",version="1"} 0

nv_trt_llm_general_metrics{general_type="iteration_counter",model="tensorrt_llm",version="1"} 4785
nv_trt_llm_general_metrics{general_type="timestamp",model="tensorrt_llm",version="1"} 1734880624970454

actual behavior

I should see

nv_inference_exec_count{model="ensemble",version="1"} 232
is smaller than
nv_inference_request_success{model="ensemble",version="1"} 232

additional notes

some triton logs after server start:

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 864.01 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15380 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.18 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 2.51 GB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.11 GiB, available: 57.33 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 6605
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 2048
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 51.60 GiB for max tokens in paged KV cache (422720).
[TensorRT-LLM][INFO] Enable MPI KV cache transport.

some triton logs after server stop:

Shutting down Triton OpenAI-Compatible Frontend...
Shutting down Triton Inference Server...
I1222 18:24:43.350231 2239 server.cc:305] "Waiting for in-flight requests to complete."
I1222 18:24:43.350291 2239 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I1222 18:24:43.351900 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for postprocessing..."
I1222 18:24:43.351925 2239 server.cc:336] "All models are stopped, unloading models"
I1222 18:24:43.351948 2239 server.cc:345] "Timeout 30: Found 5 live models and 0 in-flight non-inference requests"
I1222 18:24:43.351952 2239 server.cc:351] "ensemble v1: UNLOADING"
I1222 18:24:43.351959 2239 server.cc:351] "postprocessing v1: UNLOADING"
I1222 18:24:43.351961 2239 server.cc:351] "preprocessing v1: UNLOADING"
I1222 18:24:43.351964 2239 server.cc:351] "tensorrt_llm v1: UNLOADING"
I1222 18:24:43.351967 2239 server.cc:351] "tensorrt_llm_bls v1: UNLOADING"
I1222 18:24:43.352050 2239 dynamic_batch_scheduler.cc:445] "Stopping dynamic-batcher thread for tensorrt_llm..."
I1222 18:24:43.352085 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_bls_0_0..."
I1222 18:24:43.352109 2239 backend_model_instance.cc:807] "Stopping backend thread for postprocessing_0_1..."
I1222 18:24:43.352157 2239 model_lifecycle.cc:636] "successfully unloaded 'ensemble' version 1"
I1222 18:24:43.352168 2239 backend_model_instance.cc:807] "Stopping backend thread for preprocessing_0_2..."
I1222 18:24:43.352270 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352367 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352394 2239 python_be.cc:2387] "TRITONBACKEND_ModelInstanceFinalize: delete instance state"
I1222 18:24:43.352402 2239 backend_model_instance.cc:807] "Stopping backend thread for tensorrt_llm_0_0..."
[TensorRT-LLM][INFO] Orchestrator sendReq thread exiting
[TensorRT-LLM][INFO] Orchestrator recv thread exiting
[TensorRT-LLM][INFO] Leader recvReq thread exiting
[TensorRT-LLM][INFO] Leader sendThread exiting
I1222 18:24:43.524265 2239 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm' version 1"
[TensorRT-LLM][INFO] Refreshed the MPI local session

Let me know if you need any more information. I would like to understand why as default the
dynamic_batching {
preferred_batch_size: [ 1024 ]
max_queue_delay_microseconds: 1000000
}
is not set by the scripts in the example and do I need to set them for inflight batching?

The text was updated successfully, but these errors were encountered:

frosk1 added the bug Something isn't working label Dec 22, 2024

frosk1 changed the title ~~Inflight Batching not working~~ Inflight Batching not working with OpenAI-Compatible Frontend Dec 22, 2024

frosk1 mentioned this issue Dec 22, 2024

Performance Issue with inflight_batcher_llm Model in v0.13.0 #622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inflight Batching not working with OpenAI-Compatible Frontend #667

Inflight Batching not working with OpenAI-Compatible Frontend #667

frosk1 commented Dec 22, 2024 •

edited

Loading

Inflight Batching not working with OpenAI-Compatible Frontend #667

Inflight Batching not working with OpenAI-Compatible Frontend #667

Comments

frosk1 commented Dec 22, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

frosk1 commented Dec 22, 2024 •

edited

Loading