We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I tried the latest offical image,and follow offcial tutial, get the same bug: the link of tritonserver tutorial is https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/multimodal.md。
And my gpu is A10。
root@9d0fd755a252:/ws# I0224 03:28:06.809225 3081 model_lifecycle.cc:849] "successfully loaded 'postprocessing'" I0224 03:28:06.809270 3081 model_lifecycle.cc:849] "successfully loaded 'preprocessing'" I0224 03:28:06.821670 3081 python_be.cc:2249] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)" [TensorRT-LLM][INFO] Initialized MPI [TensorRT-LLM][INFO] Refreshed the MPI local session [TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0 [TensorRT-LLM][INFO] Rank 0 is using GPU 0 [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2 [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1 [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2560 [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0 [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (2560) * 32 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0 [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1 [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 5120 [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2559 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens). [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None I0224 03:28:08.935412 3081 model_lifecycle.cc:849] "successfully loaded 'tensorrt_llm_bls'" [TensorRT-LLM][INFO] Loaded engine size: 12860 MiB [TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues... [TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 402.52 MiB for execution context memory. [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12855 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Allocated 4.88 MB GPU memory for runtime buffers. [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.18 MB GPU memory for decoder. [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 22.19 GiB, available: 8.70 GiB [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 251 [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true E0224 03:28:22.059961 3081 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1 0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2 0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3 0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4 0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5 0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6 0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7 0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8 0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9 0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10 0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11 0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12 0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13 0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14 0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15 0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16 0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17 0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18 0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19 0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20 0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21 0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22 0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23 0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24 0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25 0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26 0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c]" E0224 03:28:22.060118 3081 model_lifecycle.cc:654] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1 0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2 0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3 0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4 0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5 0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6 0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7 0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8 0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9 0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10 0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11 0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12 0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13 0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14 0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15 0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16 0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17 0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18 0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19 0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20 0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21 0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22 0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23 0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24 0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25 0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26 0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c]" I0224 03:28:22.060173 3081 model_lifecycle.cc:789] "failed to load 'tensorrt_llm'" [TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1 [02/24/2025-03:28:28] [TRT] [I] Loaded engine size: 599 MiB [02/24/2025-03:28:28] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +29, now: CPU 0, GPU 624 (MiB) I0224 03:28:28.368156 3081 model_lifecycle.cc:849] "successfully loaded 'multimodal_encoders'" E0224 03:28:28.368315 3081 model_repository_manager.cc:703] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set crossKvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281)\n1 0x7fdb136bdff8 tensorrt_llm::common::throwRuntimeError(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 95\n2 0x7fdb136fd90b /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x77e90b) [0x7fdb136fd90b]\n3 0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489\n4 0x7fdb14597369 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185\n5 0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229\n6 0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474\n7 0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87\n8 0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e]\n9 0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185\n10 0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66\n11 0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153\n12 0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649]\n13 0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2]\n14 0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3]\n15 0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4]\n16 0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d]\n17 0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3]\n18 0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02]\n19 0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc]\n20 0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12]\n21 0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1]\n22 0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c]\n23 0x7fdbfd427305 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3f1305) [0x7fdbfd427305]\n24 0x7fdbfc991db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fdbfc991db4]\n25 0x7fdbfc646a94 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9ca94) [0x7fdbfc646a94]\n26 0x7fdbfc6d3c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fdbfc6d3c3c];" I0224 03:28:28.368439 3081 server.cc:604] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+ I0224 03:28:28.368472 3081 server.cc:631] +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/ | | | | backends","min-compute-capability":"6.000000","shm-region-prefix-name":"prefix0_", | | | | "default-max-batch-size":"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/ | | | | backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------+ I0224 03:28:28.368547 3081 server.cc:674] +---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+ | Model | Version | Status | +---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+ | multimodal_encoders | 1 | READY | | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Do not set cross | | | | KvCacheFraction for decoder-only model (/workspace/tensorrt_llm/cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp:281 | | | | ) | | | | 3 0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_l | | | | lm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt | | | | _llm::batch_manager::TrtGptModelOptionalParams const&) + 489 | | | | 6 0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::file | | | | system::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474 | | | | 3 0x7fdb14476df9 tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 489 | | | | 9 0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_bat | | | | cher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185 | | | | 5 0x7fdb145979fd tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1229 | | | | 6 0x7fdb14598c4a tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474 | | | | 7 0x7fdb1457e6d7 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87 | | | | 8 0x7fdbf02fe88e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x3388e) [0x7fdbf02fe88e] | | | | 9 0x7fdbf02fb049 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*) + 2185 | | | | 10 0x7fdbf02fb592 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState*, TRITONBACKEND_ModelInstance*, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 66 | | | | 11 0x7fdbf02e8929 TRITONBACKEND_ModelInstanceInitialize + 153 | | | | 12 0x7fdbfd1d7649 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a1649) [0x7fdbfd1d7649] | | | | 13 0x7fdbfd1d80d2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a20d2) [0x7fdbfd1d80d2] | | | | 14 0x7fdbfd1bdcf3 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187cf3) [0x7fdbfd1bdcf3] | | | | 15 0x7fdbfd1be0a4 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1880a4) [0x7fdbfd1be0a4] | | | | 16 0x7fdbfd1c768d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19168d) [0x7fdbfd1c768d] | | | | 17 0x7fdbfc64bec3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ec3) [0x7fdbfc64bec3] | | | | 18 0x7fdbfd1b4f02 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17ef02) [0x7fdbfd1b4f02] | | | | 19 0x7fdbfd1c2ddc /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18cddc) [0x7fdbfd1c2ddc] | | | | 20 0x7fdbfd1c6e12 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x190e12) [0x7fdbfd1c6e12] | | | | 21 0x7fdbfd2c78e1 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2918e1) [0x7fdbfd2c78e1] | | | | 22 0x7fdbfd2cac3c /opt/tritonserver/bin/../lib/libtritonserver.so(+0x294c3c) [0x7fdbfd2cac3c] | | tensorrt_llm_bls | 1 | READY | +---------------------+---------+------------------------------------------------------------------------------------------------------------------------------------+ I0224 03:28:28.720425 3081 metrics.cc:890] "Collecting metrics for GPU 0: NVIDIA A10" I0224 03:28:28.759813 3081 metrics.cc:783] "Collecting CPU metrics" I0224 03:28:28.760016 3081 tritonserver.cc:2598] +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.54.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared | | | _memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | multimodal_ifb/ | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | model_config_name | | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 300000000 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------+ I0224 03:28:28.760047 3081 server.cc:305] "Waiting for in-flight requests to complete." I0224 03:28:28.760062 3081 server.cc:321] "Timeout 30: Found 0 model versions that have in-flight inferences" I0224 03:28:28.760969 3081 server.cc:336] "All models are stopped, unloading models" I0224 03:28:28.760980 3081 server.cc:345] "Timeout 30: Found 4 live models and 0 in-flight non-inference requests" I0224 03:28:29.761084 3081 server.cc:345] "Timeout 29: Found 4 live models and 0 in-flight non-inference requests" [02/24/2025-03:28:29] [TRT-LLM] [I] Cleaning up... Cleaning up... Cleaning up... Cleaning up... I0224 03:28:30.152836 3081 model_lifecycle.cc:636] "successfully unloaded 'tensorrt_llm_bls' version 1" I0224 03:28:30.761212 3081 server.cc:345] "Timeout 28: Found 3 live models and 0 in-flight non-inference requests" I0224 03:28:31.033774 3081 model_lifecycle.cc:636] "successfully unloaded 'preprocessing' version 1" I0224 03:28:31.091577 3081 model_lifecycle.cc:636] "successfully unloaded 'postprocessing' version 1" I0224 03:28:31.406108 3081 model_lifecycle.cc:636] "successfully unloaded 'multimodal_encoders' version 1" I0224 03:28:31.761284 3081 server.cc:345] "Timeout 27: Found 0 live models and 0 in-flight non-inference requests" error: creating server: Internal - failed to load all models
And My config.pbtxt has set the key of cross_kv_cache_fraction:
** value: { string_value: "${batch_scheduler_policy}" } } parameters: { key: "kv_cache_free_gpu_mem_fraction" value: { string_value: "${kv_cache_free_gpu_mem_fraction}" } } parameters: { key: "cross_kv_cache_fraction" value: { string_value: "0.5" } } parameters: { key: "kv_cache_host_memory_bytes" value: { string_value: "${kv_cache_host_memory_bytes}" } }**
examples
The text was updated successfully, but these errors were encountered:
could you remove these line ? parameters: { key: "cross_kv_cache_fraction" value: { string_value: "0.5" } }
or
parameters: { key: "cross_kv_cache_fraction" value: { string_value: "" } }
Sorry, something went wrong.
No branches or pull requests
System Info
I tried the latest offical image,and follow offcial tutial, get the same bug:
the link of tritonserver tutorial is https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/multimodal.md。
And my gpu is A10。
Who can help?
And My config.pbtxt has set the key of cross_kv_cache_fraction:
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)The text was updated successfully, but these errors were encountered: