Error during TensorRT-LLM build - Invalid shape and type mismatch in elementwise addition #2622

cocovoc · 2024-12-24T12:32:40Z

Description:
I'm encountering an error when building the model using TensorRT-LLM version 0.13.0. The error occurs when performing an elementwise addition operation within the embedding layer of the EncoderModel. Specifically, it seems that the input tensors for the addition operation have mismatched data types (Half and Float), which causes a failure.

Error Message:

[12/24/2024-20:25:23] [TRT-LLM] [W] Found pynvml==11.4.0 and cuda driver version b'550.54.15'. Please use pynvml>=11.5.0 and cuda driver>=526 to get accurate memory usage.
[TensorRT-LLM] TensorRT-LLM version: 0.13.0
[12/24/2024-20:25:23] [TRT-LLM] [W] Option --paged_kv_cache is deprecated, use --kv_cache_type=paged/disabled instead.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set gemm_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set nccl_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set lookup_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set lora_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set moe_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set context_fmha to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set remove_input_padding to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set reduce_fusion to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set enable_xqa to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set tokens_per_block to 64.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set multiple_profiles to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set paged_state to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set streamingllm to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set use_fused_mlp to True.
[12/24/2024-20:25:23] [TRT-LLM] [I] Set paged_kv_cache to False.
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_position_embedding = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_type = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_attention_qkvo_bias = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_mlp_bias = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_model_final_layernorm = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_layernorm = False
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.has_embedding_scale = True
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.q_scaling = 1.0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layernorm_position = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.mlp_type = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.relative_attention = False
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.max_distance = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.num_buckets = 0
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.model_type = nmt
[12/24/2024-20:25:23] [TRT-LLM] [W] Implicitly setting PretrainedConfig.gated_act = False
[12/24/2024-20:25:23] [TRT-LLM] [I] Compute capability: (7, 5)
[12/24/2024-20:25:23] [TRT-LLM] [I] SM count: 40
[12/24/2024-20:25:23] [TRT-LLM] [I] SM clock: 1590 MHz
[12/24/2024-20:25:23] [TRT-LLM] [I] int4 TFLOPS: 260
[12/24/2024-20:25:23] [TRT-LLM] [I] int8 TFLOPS: 130
[12/24/2024-20:25:23] [TRT-LLM] [I] fp8 TFLOPS: 0
[12/24/2024-20:25:23] [TRT-LLM] [I] float16 TFLOPS: 65
[12/24/2024-20:25:23] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[12/24/2024-20:25:23] [TRT-LLM] [I] float32 TFLOPS: 8
[12/24/2024-20:25:23] [TRT-LLM] [I] Total Memory: 15 GiB
[12/24/2024-20:25:23] [TRT-LLM] [I] Memory clock: 5001 MHz
[12/24/2024-20:25:23] [TRT-LLM] [I] Memory bus width: 256
[12/24/2024-20:25:23] [TRT-LLM] [I] Memory bandwidth: 320 GB/s
[12/24/2024-20:25:23] [TRT-LLM] [I] PCIe speed: 8000 Mbps
[12/24/2024-20:25:23] [TRT-LLM] [I] PCIe link width: 16
[12/24/2024-20:25:23] [TRT-LLM] [I] PCIe bandwidth: 16 GB/s
[12/24/2024-20:25:23] [TRT-LLM] [W] Parameter was initialized as DataType.BF16 but set to DataType.FLOAT
[12/24/2024-20:25:23] [TRT-LLM] [I] Set dtype to bfloat16.
[12/24/2024-20:25:23] [TRT-LLM] [W] Overriding paged_state to False
[12/24/2024-20:25:23] [TRT-LLM] [I] Set paged_state to False.
[12/24/2024-20:25:23] [TRT-LLM] [I] max_seq_len is not specified for EncoderModel, using --max_input_len.
[12/24/2024-20:25:23] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[12/24/2024-20:25:23] [TRT-LLM] [W] max_num_tokens (800) shouldn't be greater than max_seq_len * max_batch_size (800), specifying to max_seq_len * max_batch_size (800).
[12/24/2024-20:25:23] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[12/24/2024-20:25:24] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 151, GPU 8302 (MiB)
[12/24/2024-20:25:25] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +955, GPU +190, now: CPU 1261, GPU 8492 (MiB)
[12/24/2024-20:25:25] [TRT-LLM] [I] Set nccl_plugin to None.
[12/24/2024-20:25:25] [TRT] [W] IElementWiseLayer with inputs EncoderModel/embedding/__mul___L345/elementwise_binary_L2855/ELEMENTWISE_PROD_0_output_0 and EncoderModel/embedding/position_embedding/embedding_L2693/GATHER_0_output_0: first input has type BFloat16 but second input has type Float.
[12/24/2024-20:25:25] [TRT] [E] ITensor::getDimensions: Error Code 4: API Usage Error (EncoderModel/embedding/__add___L321/elementwise_binary_L2855/ELEMENTWISE_SUM_0: ElementWiseOperation SUM must have same input types. But they are of types BFloat16 and Float.)
Traceback (most recent call last):
  File "/opt/miniconda/envs/python37/envs/py10/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 575, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 429, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 396, in build_and_save
    engine = build_model(build_config,
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 389, in build_model
    return build(model, build_config)
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1152, in build
    model(**inputs)
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 640, in forward
    hidden_states = self.embedding(input_ids, position_ids,
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/module.py", line 40, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/models/enc_dec/model.py", line 143, in forward
    x = x + pos_emb
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 321, in __add__
    return add(self, b)
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 2855, in elementwise_binary
    return _create_tensor(layer.get_output(0), layer)
  File "/opt/miniconda/envs/python37/envs/py10/lib/python3.10/site-packages/tensorrt_llm/functional.py", line 607, in _create_tensor
    assert trt_tensor.shape.__len__(
AssertionError: tensor EncoderModel/embedding/__add___L321/elementwise_binary_L2855/ELEMENTWISE_SUM_0_output_0 has an invalid shape

Convert script:

 export MODEL_NAME="FairSeq_1223" # or "flan-t5-small"
export MODEL_TYPE="nmt"
export INFERENCE_PRECISION="float16"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
export MAX_BEAM_WIDTH=2
model_dir="modelDir/model"
python convert_checkpoint.py --model_type ${MODEL_TYPE} \
                --model_dir ${model_dir} \
                --output_dir tmp/trt_models/${MODEL_NAME}/${INFERENCE_PRECISION} \
                --tp_size ${TP_SIZE} \
                --pp_size ${PP_SIZE} \
                --dtype ${INFERENCE_PRECISION}

trt build:

export MODEL_NAME="FairSeq_1223" # or "flan-t5-small"
export MODEL_TYPE="nmt"
export INFERENCE_PRECISION="float16"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1
export MAX_BEAM_WIDTH=1


# Note: non-T5 models can enable FMHA for the encoder part, for FP16/BF16, the default is enabled
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/${INFERENCE_PRECISION}/encoder \
                --output_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION}/encoder \
                --paged_kv_cache disable \
                --moe_plugin disable \
                 --enable_xqa disable \
                --max_beam_width ${MAX_BEAM_WIDTH} \
                --max_batch_size 8 \
                --max_input_len 100 \
                --bert_attention_plugin ${INFERENCE_PRECISION} \
                --gpt_attention_plugin ${INFERENCE_PRECISION} \
                --remove_input_padding disable
                # --context_fmha disable should be remove
echo "encoder done!"
trtllm-build --checkpoint_dir tmp/trt_models/${MODEL_NAME}/${INFERENCE_PRECISION}/decoder \
                --output_dir tmp/trt_engines/${MODEL_NAME}/${INFERENCE_PRECISION}/decoder \
                --moe_plugin disable \
                --enable_xqa disable \
                --max_beam_width ${MAX_BEAM_WIDTH} \
                --max_batch_size 8 \
                --max_input_len 1 \
                --max_seq_len 201  \
                --max_encoder_input_len 100 \
                --bert_attention_plugin ${INFERENCE_PRECISION} \
                --gpt_attention_plugin ${INFERENCE_PRECISION} \
                --remove_input_padding disable
                # --context_fmha disable should be removed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during TensorRT-LLM build - Invalid shape and type mismatch in elementwise addition #2622

Error during TensorRT-LLM build - Invalid shape and type mismatch in elementwise addition #2622

cocovoc commented Dec 24, 2024 •

edited

Loading

Error during TensorRT-LLM build - Invalid shape and type mismatch in elementwise addition #2622

Error during TensorRT-LLM build - Invalid shape and type mismatch in elementwise addition #2622

Comments

cocovoc commented Dec 24, 2024 • edited Loading

cocovoc commented Dec 24, 2024 •

edited

Loading