You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.
#2636
Open
anapple-hub opened this issue
Dec 28, 2024
· 0 comments
1*8A100 DGX BOX
Torch version: 2.4.0a0+07cecf4168.nv24.5
CUDA version: 12.4
tensorrt-llm: 0.12.0
When using ModelRunnerCpp(executor) for inference with the prompt_table parameter, an error occurs,the error occurs at the code request_ids = self.session.enqueue_requests(requests):
[TensorRT-LLM][INFO] Engine version 0.12.0.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2176
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2176
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 17408
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2175 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3523 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1364.26 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3519 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 69.10 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 23.95 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.35 GiB, available: 69.29 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 2661
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 17
[TensorRT-LLM][INFO] Number of tokens per block: 128.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.37 GiB for max tokens in paged KV cache (340608).
0%| | 0/35511 [00:00<?, ?it/s][TensorRT-LLM][ERROR] Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details. (tensorrt-llm/cpp/tensorrt_llm/runtime/torchView.h:89)
1 0x7fd1673b6aa0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2daa0) [0x7fd1673b6aa0]
2 0x7fd0f4e2d536 tensorrt_llm::runtime::ITensor::unsqueeze(int) + 54
3 0x7fd0f4e339de tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::GenericLlmRequest(unsigned long, tensorrt_llm::executor::Request const&) + 2190
4 0x7fd0f4e2b411 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 321
5 0x7fd0f4e2cda3 tensorrt_llm::executor::Executor::Impl::executionLoop() + 627
6 0x7fd350501253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd350501253]
7 0x7fd352380ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd352380ac3]
8 0x7fd352412850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fd352412850]
I try to unset TORCH_DISTRIBUTED_DEBUG by export TORCH_DISTRIBUTED_DEBUG=OFF, but not work.
The text was updated successfully, but these errors were encountered:
1*8A100 DGX BOX
Torch version: 2.4.0a0+07cecf4168.nv24.5
CUDA version: 12.4
tensorrt-llm: 0.12.0
When using ModelRunnerCpp(executor) for inference with the prompt_table parameter, an error occurs,the error occurs at the code
request_ids = self.session.enqueue_requests(requests)
:[TensorRT-LLM][INFO] Engine version 0.12.0.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2176
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2176
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 17408
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2175 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3523 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1364.26 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3519 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 69.10 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 23.95 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.35 GiB, available: 69.29 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 2661
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 17
[TensorRT-LLM][INFO] Number of tokens per block: 128.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.37 GiB for max tokens in paged KV cache (340608).
0%| | 0/35511 [00:00<?, ?it/s][TensorRT-LLM][ERROR] Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details. (tensorrt-llm/cpp/tensorrt_llm/runtime/torchView.h:89)
1 0x7fd1673b6aa0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2daa0) [0x7fd1673b6aa0]
2 0x7fd0f4e2d536 tensorrt_llm::runtime::ITensor::unsqueeze(int) + 54
3 0x7fd0f4e339de tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::GenericLlmRequest(unsigned long, tensorrt_llm::executor::Request const&) + 2190
4 0x7fd0f4e2b411 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 321
5 0x7fd0f4e2cda3 tensorrt_llm::executor::Executor::Impl::executionLoop() + 627
6 0x7fd350501253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd350501253]
7 0x7fd352380ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd352380ac3]
8 0x7fd352412850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fd352412850]
I try to
unset TORCH_DISTRIBUTED_DEBUG
byexport TORCH_DISTRIBUTED_DEBUG=OFF
, but not work.The text was updated successfully, but these errors were encountered: