unable to build qwen awq model with multi gpus #776
Labels
Low Precision
Issue about lower bit quantization, including int8, int4, fp8
stale
triaged
Issue has been triaged by maintainers
python quantize.py --model_dir /qwen-14b-chat --dtype float16 --qformat int4_awq --export_path ./qwen_14b_4bit_gs128_awq.pt --calib_size 32
python build.py --hf_model_dir=/qwen-14b-chat/ --quant_ckpt_path ./qwen_14b_4bit_gs128_awq.pt --output_dir ./tmp/ --dtype float16 --use_inflight_batching --paged_kv_cache --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --max_batch_size 16 --enable_context_fmha --use_weight_only --weight_only_precision int4_awq --per_group --world_size 2 --tp_size 2
[12/29/2023-11:05:28] [TRT-LLM] [I] Serially build TensorRT engines.
[12/29/2023-11:05:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 113, GPU 263 (MiB)
[12/29/2023-11:05:32] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2048, GPU 575 (MiB)
[12/29/2023-11:05:32] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[12/29/2023-11:05:32] [TRT-LLM] [I] Loading weights from groupwise AWQ Qwen safetensors...
[12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
Loading weights...: 0%| | 0/40 [00:00<?, ?it/s][12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
Loading weights...: 0%| | 0/40 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/qwen/build.py", line 642, in
build(0, args)
File "/app/tensorrt_llm/examples/qwen/build.py", line 612, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/app/tensorrt_llm/examples/qwen/build.py", line 457, in build_rank_engine
load_func(tensorrt_llm_qwen=tensorrt_llm_qwen,
File "/app/tensorrt_llm/examples/qwen/weight.py", line 897, in load_from_awq_qwen
process_and_assign_weight(model_params, mPrefix, mOp, 0)
File "/app/tensorrt_llm/examples/qwen/weight.py", line 830, in process_and_assign_weight
mOp.qweight.value = AWQ_quantize_pack_preprocess(weight, scale)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 112, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (2560, 6848), original: (5120, 3424)
The text was updated successfully, but these errors were encountered: