Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to build qwen awq model with multi gpus #776

Closed
tbup opened this issue Dec 29, 2023 · 5 comments
Closed

unable to build qwen awq model with multi gpus #776

tbup opened this issue Dec 29, 2023 · 5 comments
Assignees
Labels
Low Precision Issue about lower bit quantization, including int8, int4, fp8 stale triaged Issue has been triaged by maintainers

Comments

@tbup
Copy link

tbup commented Dec 29, 2023

python quantize.py --model_dir /qwen-14b-chat --dtype float16 --qformat int4_awq --export_path ./qwen_14b_4bit_gs128_awq.pt --calib_size 32

python build.py --hf_model_dir=/qwen-14b-chat/ --quant_ckpt_path ./qwen_14b_4bit_gs128_awq.pt --output_dir ./tmp/ --dtype float16 --use_inflight_batching --paged_kv_cache --use_gpt_attention_plugin float16 --use_gemm_plugin float16 --remove_input_padding --max_batch_size 16 --enable_context_fmha --use_weight_only --weight_only_precision int4_awq --per_group --world_size 2 --tp_size 2

[12/29/2023-11:05:28] [TRT-LLM] [I] Serially build TensorRT engines.
[12/29/2023-11:05:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 113, GPU 263 (MiB)
[12/29/2023-11:05:32] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 2048, GPU 575 (MiB)
[12/29/2023-11:05:32] [TRT-LLM] [W] Invalid timing cache, using freshly created one
[12/29/2023-11:05:32] [TRT-LLM] [I] Loading weights from groupwise AWQ Qwen safetensors...
[12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
Loading weights...: 0%| | 0/40 [00:00<?, ?it/s][12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:45] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.INT8 but set to int8
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
[12/29/2023-11:05:46] [TRT-LLM] [W] Parameter was initialized as DataType.HALF but set to float16
Loading weights...: 0%| | 0/40 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/qwen/build.py", line 642, in
build(0, args)
File "/app/tensorrt_llm/examples/qwen/build.py", line 612, in build
engine = build_rank_engine(builder, builder_config, engine_name,
File "/app/tensorrt_llm/examples/qwen/build.py", line 457, in build_rank_engine
load_func(tensorrt_llm_qwen=tensorrt_llm_qwen,
File "/app/tensorrt_llm/examples/qwen/weight.py", line 897, in load_from_awq_qwen
process_and_assign_weight(model_params, mPrefix, mOp, 0)
File "/app/tensorrt_llm/examples/qwen/weight.py", line 830, in process_and_assign_weight
mOp.qweight.value = AWQ_quantize_pack_preprocess(weight, scale)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/parameter.py", line 112, in value
assert v.shape == self._shape,
AssertionError: The value updated is not the same shape as the original. Updated: (2560, 6848), original: (5120, 3424)

@juney-nvidia
Copy link
Collaborator

@tbup
Thanks for reporting this, I will discuss with the engineer adding the Qwen INT4 AWQ support to help investigate it.

June

@juney-nvidia juney-nvidia self-assigned this Dec 30, 2023
@juney-nvidia juney-nvidia added triaged Issue has been triaged by maintainers Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Dec 30, 2023
@juney-nvidia
Copy link
Collaborator

@tbup
Our engineer already started the investigation and is trying to make the fix.

Thanks
June

@nanmi
Copy link

nanmi commented Jan 5, 2024

@tbup Our engineer already started the investigation and is trying to make the fix.

Thanks June

maybe MLP module gate and up_proj has the same split dim=1(columlinear should split output channel) and down_proj split dim=0(RowLinear should split input channel)

@hello-11
Copy link
Collaborator

@tbup Do you still have the problem? If not, we will close it soon.

@hello-11 hello-11 added the stale label Nov 18, 2024
@nv-guomingz
Copy link
Collaborator

Feel free to reopen it if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Low Precision Issue about lower bit quantization, including int8, int4, fp8 stale triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

5 participants