[TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model #2814

HPUedCSLearner · 2025-02-24T03:43:18Z

System Info

I tried the latest offical image，and follow offcial tutial, get the same bug:
the link of tritonserver tutorial is https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/multimodal.md。

And my gpu is A10。

root@9d0fd755a252:/ws# I0224 03:28:06.809225 3081 model_lifecycle.cc:849] "successfully loaded 'postprocessing'"
I0224 03:28:06.809270 3081 model_lifecycle.cc:849] "successfully loaded 'preprocessing'"
I0224 03:28:06.821670 3081 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2560
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2560) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 5120
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2559 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
I0224 03:28:08.935412 3081 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'"
[TensorRT-LLM][INFO] Loaded engine size: 12860 MiB
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 402.52 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12855 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.88 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.18 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 8.70 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 251
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
E0224 03:28:22.059961 3081 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1       0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2       0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4       0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23      0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24      0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25      0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26      0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c]"
E0224 03:28:22.060118 3081 model_lifecycle.cc:654] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1       0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2       0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4       0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23      0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24      0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25      0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26      0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c]"
I0224 03:28:22.060173 3081 model_lifecycle.cc:789] "failed to load 'tensorrt_llm'"
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1
[02/24/2025-03:28:28] [TRT] [I] Loaded engine size: 599 MiB
[02/24/2025-03:28:28] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +29, now: CPU 0, GPU 624 (MiB)
I0224 03:28:28.368156 3081 model_lifecycle.cc:849] "successfully loaded 'multimodal_encoders'"
E0224 03:28:28.368315 3081 model_repository_manager.cc:703] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1       0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2       0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4       0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23      0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24      0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25      0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26      0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c];"
I0224 03:28:28.368439 3081 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0224 03:28:28.368472 3081 server.cc:631] 
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                             |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/ |
|             |                                                                 | backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_", |
|             |                                                                 | "default-max-batch-size":"4"}}                                                     |
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/ |
|             |                                                                 | backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}       |
+-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+

I0224 03:28:28.368547 3081 server.cc:674] 
+---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+
| Model               | Version | Status                                                                                                                             |
+---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+
| multimodal_encoders | 1       | READY                                                                                                                              |
| postprocessing      | 1       | READY                                                                                                                              |
| preprocessing       | 1       | READY                                                                                                                              |
| tensorrt_llm        | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set cross |
|                     |         | KvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281 |
|                     |         | )                                                                                                                                  |
|                     |         | 3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_l |
|                     |         | lm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt |
|                     |         | _llm::batch_manager::TrtGptModelOptionalParams const&) + 489                                                                       |
|                     |         | 6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::file |
|                     |         | system::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474            |
|                     |         | 3       0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489 |
|                     |         | 9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_bat |
|                     |         | cher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185                                                                        |
|                     |         | 5       0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229 |
|                     |         | 6       0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474 |
|                     |         | 7       0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87 |
|                     |         | 8       0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]                  |
|                     |         | 9       0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185 |
|                     |         | 10      0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 |
|                     |         | 11      0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153                                                                 |
|                     |         | 12      0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]                                 |
|                     |         | 13      0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]                                 |
|                     |         | 14      0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]                                 |
|                     |         | 15      0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]                                 |
|                     |         | 16      0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]                                 |
|                     |         | 17      0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]                                              |
|                     |         | 18      0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]                                 |
|                     |         | 19      0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]                                 |
|                     |         | 20      0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]                                 |
|                     |         | 21      0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]                                 |
|                     |         | 22      0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]                                 |
| tensorrt_llm_bls    | 1       | READY                                                                                                                              |
+---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+

I0224 03:28:28.720425 3081 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA A10"
I0224 03:28:28.759813 3081 metrics.cc:783] "Collecting CPU metrics"
I0224 03:28:28.760016 3081 tritonserver.cc:2598] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                          |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                         |
| server_version                   | 2.54.0                                                                                                                         |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared |
|                                  | _memory cuda_shared_memory binary_tensor_data parameters statistics trace logging                                              |
| model_repository_path[0]         | multimodal_ifb/                                                                                                                |
| model_control_mode               | MODE_NONE                                                                                                                      |
| strict_model_config              | 1                                                                                                                              |
| model_config_name                |                                                                                                                                |
| rate_limit                       | OFF                                                                                                                            |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                      |
| cuda_memory_pool_byte_size{0}    | 300000000                                                                                                                      |
| min_supported_compute_capability | 6.0                                                                                                                            |
| strict_readiness                 | 1                                                                                                                              |
| exit_timeout                     | 30                                                                                                                             |
| cache_enabled                    | 0                                                                                                                              |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+

I0224 03:28:28.760047 3081 server.cc:305] "Waiting for in-flight requests to complete."
I0224 03:28:28.760062 3081 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0224 03:28:28.760969 3081 server.cc:336] "All models are stopped, unloading models"
I0224 03:28:28.760980 3081 server.cc:345] "Timeout 30: Found 4 live models and 0 in-flight non-inference requests"
I0224 03:28:29.761084 3081 server.cc:345] "Timeout 29: Found 4 live models and 0 in-flight non-inference requests"
[02/24/2025-03:28:29] [TRT-LLM] [I] Cleaning up...
Cleaning up...
Cleaning up...
Cleaning up...
I0224 03:28:30.152836 3081 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm_bls' version 1"
I0224 03:28:30.761212 3081 server.cc:345] "Timeout 28: Found 3 live models and 0 in-flight non-inference requests"
I0224 03:28:31.033774 3081 model_lifecycle.cc:636] "successfully unloaded 'preprocessing' version 1"
I0224 03:28:31.091577 3081 model_lifecycle.cc:636] "successfully unloaded 'postprocessing' version 1"
I0224 03:28:31.406108 3081 model_lifecycle.cc:636] "successfully unloaded 'multimodal_encoders' version 1"
I0224 03:28:31.761284 3081 server.cc:345] "Timeout 27: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models

Who can help?

And My config.pbtxt has set the key of cross_kv_cache_fraction:

**  value: {
    string_value: "${batch_scheduler_policy}"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "${kv_cache_free_gpu_mem_fraction}"
  }
}
parameters: {
  key: "cross_kv_cache_fraction"
  value: {
    string_value: "0.5"
  }
}
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "${kv_cache_host_memory_bytes}"
  }
}**

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

The text was updated successfully, but these errors were encountered:

chuangz0 · 2025-02-24T04:13:14Z

could you remove these line ?
parameters: {
key: "cross_kv_cache_fraction"
value: {
string_value: "0.5"
}
}

or

parameters: {
key: "cross_kv_cache_fraction"
value: {
string_value: ""
}
}

HPUedCSLearner added the bug Something isn't working label Feb 24, 2025

chuangz0 added the triaged Issue has been triaged by maintainers label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model #2814

[TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model #2814

HPUedCSLearner commented Feb 24, 2025 •

edited

Loading

chuangz0 commented Feb 24, 2025 •

edited

Loading

[TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model #2814

[TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model #2814

Comments

HPUedCSLearner commented Feb 24, 2025 • edited Loading

System Info

Who can help?

Information

Tasks

chuangz0 commented Feb 24, 2025 • edited Loading

HPUedCSLearner commented Feb 24, 2025 •

edited

Loading

chuangz0 commented Feb 24, 2025 •

edited

Loading