You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
I'm encountering an error when building the model using TensorRT-LLM version 0.13.0. The error occurs when performing an elementwise addition operation within the embedding layer of the EncoderModel. Specifically, it seems that the input tensors for the addition operation have mismatched data types (Half and Float), which causes a failure.
Error Message:
[12/24/2024-20:25:23] [TRT-LLM] [W] Found pynvml==11.4.0 and cuda driver version b'550.54.15'. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
[12/24/2024-20:25:23] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set gemm_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set nccl_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set lookup_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set lora_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set moe_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set context_fmha to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set remove_input_padding to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set reduce_fusion to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set enable_xqa to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set tokens_per_block to 64.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set multiple_profiles to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set paged_state to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set streamingllm to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set use_fused_mlp to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set paged_kv_cache to False.
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_type = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_attention_qkvo_bias = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_mlp_bias = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_model_final_layernorm = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_layernorm = False
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_scale = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.q_scaling = 1.0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_position = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.mlp_type = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.relative_attention = False
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.max_distance = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_buckets = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.model_type = nmt
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.gated_act = False
[12/24/2024-20:25:23] [TRT-LLM] [I] Compute capability: (7, 5)
[12/24/2024-20:25:23] [TRT-LLM] [I] SM count: 40
[12/24/2024-20:25:23] [TRT-LLM] [I] SM clock: 1590 MHz
[12/24/2024-20:25:23] [TRT-LLM] [I] int4 TFLOPS: 260
[12/24/2024-20:25:23] [TRT-LLM] [I] int8 TFLOPS: 130
[12/24/2024-20:25:23] [TRT-LLM] [I] fp8 TFLOPS: 0
[12/24/2024-20:25:23] [TRT-LLM] [I] float16 TFLOPS: 65
[12/24/2024-20:25:23] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[12/24/2024-20:25:23] [TRT-LLM] [I] float32 TFLOPS: 8
[12/24/2024-20:25:23] [TRT-LLM] [I] Total Memory: 15 GiB
[12/24/2024-20:25:23] [TRT-LLM] [I] Memory clock: 5001 MHz
[12/24/2024-20:25:23] [TRT-LLM] [I] Memory bus width: 256
[12/24/2024-20:25:23] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[12/24/2024-20:25:23] [TRT-LLM] [I] PCIe speed: 8000 Mbps
[12/24/2024-20:25:23] [TRT-LLM] [I] PCIe link width: 16
[12/24/2024-20:25:23] [TRT-LLM] [I] PCIe bandwidth: 16 GB/s
[12/24/2024-20:25:23] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT
[12/24/2024-20:25:23] [TRT-LLM] [I] Set dtype to bfloat16.
[12/24/2024-20:25:23] [TRT-LLM] [W] Overriding paged_state to False
[12/24/2024-20:25:23] [TRT-LLM] [I] Set paged_state to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] max_seq_len is not specified for EncoderModel, using --max_input_len.
[12/24/2024-20:25:23] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[12/24/2024-20:25:23] [TRT-LLM] [W] max_num_tokens (800) shouldn't be greater than max_seq_len * max_batch_size (800), specifying to max_seq_len * max_batch_size (800).[12/24/2024-20:25:23] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored[12/24/2024-20:25:24] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 151, GPU 8302 (MiB)[12/24/2024-20:25:25] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +955, GPU +190, now: CPU 1261, GPU 8492 (MiB)[12/24/2024-20:25:25] [TRT-LLM] [I] Set nccl_plugin to None.[12/24/2024-20:25:25] [TRT] [W] IElementWiseLayer with inputs EncoderModel/embedding/__mul___L345/elementwise_binary_L2855/ELEMENTWISE_PROD_0_output_0 and EncoderModel/embedding/position_embedding/embedding_L2693/GATHER_0_output_0: first input has type BFloat16 but second input has type Float.[12/24/2024-20:25:25] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (EncoderModel/embedding/__add___L321/elementwise_binary_L2855/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types BFloat16 and Float.)Traceback (most recent call last): File "/opt/miniconda/envs/python37/envs/py10/bin/trtllm-build", line 8, in <module> sys.exit(main()) File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 575, in main parallel_build(model_config, ckpt_dir, build_config, args.output_dir, File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 429, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 396, in build_and_save engine = build_model(build_config, File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 389, in build_model return build(model, build_config) File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1152, in build model(**inputs) File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in __call__ output = self.forward(*args, **kwargs) File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 640, in forward hidden_states = self.embedding(input_ids, position_ids, File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in __call__ output = self.forward(*args, **kwargs) File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 143, in forward x = x + pos_emb File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 321, in __add__ return add(self, b) File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 2855, in elementwise_binary return _create_tensor(layer.get_output(0), layer) File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 607, in _create_tensor assert trt_tensor.shape.__len__(AssertionError: tensor EncoderModel/embedding/__add___L321/elementwise_binary_L2855/ELEMENTWISE_SUM_0_output_0 has an invalid shape
Description:
I'm encountering an error when building the model using TensorRT-LLM version 0.13.0. The error occurs when performing an elementwise addition operation within the embedding layer of the EncoderModel. Specifically, it seems that the input tensors for the addition operation have mismatched data types (Half and Float), which causes a failure.
Error Message:
Convert script:
trt build:
The text was updated successfully, but these errors were encountered: