unable to build qwen awq model with multi gpus #776

tbup · 2023-12-29T11:43:12Z

python quantize.py --model_dir /qwen-14b-chat --dtype float16 --qformat int4_awq --export_path ./qwen_14b_4bit_gs128_awq.pt --calib_size 32

python build.py --hf_model_dir=/qwen-14b-chat/ --quant_ckpt_path ./qwen_14b_4bit_gs128_awq.pt --output_dir ./tmp/ --dtype float16 --use_inflight_batching --paged_kv_cache --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --max_batch_size 16 --enable_context_fmha --use_weight_only --weight_only_precision int4_awq --per_group --world_size 2 --tp_size 2

[12/29/2023-11:05:28] [TRT-LLM] [I] Serially build TensorRT engines.
[12/29/2023-11:05:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 113, GPU 263 (MiB)
[12/29/2023-11:05:32] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2048, GPU 575 (MiB)
[12/29/2023-11:05:32] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[12/29/2023-11:05:32] [TRT-LLM] [I] Loading weights from groupwise AWQ Qwen safetensors...
[12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
Loading weights...: 0%| | 0/40 [00:00<?, ?it/s][12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
Loading weights...: 0%| | 0/40 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/qwen/build.py", line 642, in
build(0, args)
File "/app/tensorrt_llm/examples/qwen/build.py", line 612, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/app/tensorrt_llm/examples/qwen/build.py", line 457, in build_rank_engine
load_func(tensorrt_llm_qwen=tensorrt_llm_qwen,
File "/app/tensorrt_llm/examples/qwen/weight.py", line 897, in load_from_awq_qwen
process_and_assign_weight(model_params, mPrefix, mOp, 0)
File "/app/tensorrt_llm/examples/qwen/weight.py", line 830, in process_and_assign_weight
mOp.qweight.value = AWQ_quantize_pack_preprocess(weight, scale)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 112, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (2560, 6848), original: (5120, 3424)

juney-nvidia · 2023-12-30T12:59:38Z

@tbup
Thanks for reporting this, I will discuss with the engineer adding the Qwen INT4 AWQ support to help investigate it.

June

juney-nvidia · 2024-01-02T07:28:17Z

@tbup
Our engineer already started the investigation and is trying to make the fix.

Thanks
June

nanmi · 2024-01-05T03:14:28Z

@tbup Our engineer already started the investigation and is trying to make the fix.

Thanks June

maybe MLP module gate and up_proj has the same split dim=1(columlinear should split output channel) and down_proj split dim=0(RowLinear should split input channel)

hello-11 · 2024-11-18T02:41:07Z

@tbup Do you still have the problem? If not, we will close it soon.

nv-guomingz · 2024-12-04T10:13:07Z

Feel free to reopen it if needed.

juney-nvidia self-assigned this Dec 30, 2023

juney-nvidia added triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Dec 30, 2023

hello-11 added the stale label Nov 18, 2024

nv-guomingz closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to build qwen awq model with multi gpus #776

unable to build qwen awq model with multi gpus #776

tbup commented Dec 29, 2023

juney-nvidia commented Dec 30, 2023

juney-nvidia commented Jan 2, 2024

nanmi commented Jan 5, 2024

hello-11 commented Nov 18, 2024

nv-guomingz commented Dec 4, 2024

unable to build qwen awq model with multi gpus #776

unable to build qwen awq model with multi gpus #776

Comments

tbup commented Dec 29, 2023

juney-nvidia commented Dec 30, 2023

juney-nvidia commented Jan 2, 2024

nanmi commented Jan 5, 2024

hello-11 commented Nov 18, 2024

nv-guomingz commented Dec 4, 2024