RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update. #2636

anapple-hub · 2024-12-28T02:06:26Z

1*8A100 DGX BOX
Torch version: 2.4.0a0+07cecf4168.nv24.5
CUDA version: 12.4
tensorrt-llm: 0.12.0
When using ModelRunnerCpp(executor) for inference with the prompt_table parameter, an error occurs，the error occurs at the code request_ids = self.session.enqueue_requests(requests):

[TensorRT-LLM][INFO] Engine version 0.12.0.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0.2 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 2176
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 2176
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 17408
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 2175 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3523 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1364.26 MiB for execution context memory.
[TensorRT-LLM][INFO] [MS] Running engine with multi stream info
[TensorRT-LLM][INFO] [MS] Number of aux streams is 1
[TensorRT-LLM][INFO] [MS] Number of total worker streams is 2
[TensorRT-LLM][INFO] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3519 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 69.10 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 23.95 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.35 GiB, available: 69.29 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 2661
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 17
[TensorRT-LLM][INFO] Number of tokens per block: 128.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.37 GiB for max tokens in paged KV cache (340608).
0%| | 0/35511 [00:00<?, ?it/s][TensorRT-LLM][ERROR] Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See pytorch/rfcs#17 for more details. (tensorrt-llm/cpp/tensorrt_llm/runtime/torchView.h:89)
1 0x7fd1673b6aa0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x2daa0) [0x7fd1673b6aa0]
2 0x7fd0f4e2d536 tensorrt_llm::runtime::ITensor::unsqueeze(int) + 54
3 0x7fd0f4e339de tensorrt_llm::batch_manager::GenericLlmRequest<std::shared_ptr<tensorrt_llm::runtime::ITensor>, std::shared_ptr<tensorrt_llm::runtime::CudaStream> >::GenericLlmRequest(unsigned long, tensorrt_llm::executor::Request const&) + 2190
4 0x7fd0f4e2b411 tensorrt_llm::executor::Executor::Impl::fetchNewRequests[abi:cxx11](int, std::optional) + 321
5 0x7fd0f4e2cda3 tensorrt_llm::executor::Executor::Impl::executionLoop() + 627
6 0x7fd350501253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fd350501253]
7 0x7fd352380ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fd352380ac3]
8 0x7fd352412850 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7fd352412850]

I try to unset TORCH_DISTRIBUTED_DEBUG by export TORCH_DISTRIBUTED_DEBUG=OFF, but not work.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update. #2636

RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update. #2636

anapple-hub commented Dec 28, 2024

RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update. #2636

RuntimeError: Encountered an error when fetching new request: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update. #2636

Comments

anapple-hub commented Dec 28, 2024