[Feature Request] Better support for w4a8 quantization #2605

ShuaiShao93 · 2024-12-20T20:45:57Z

Based on this doc, we have to use deepcompressor to build prepare fake-quantized checkpoint. However, it's a lot more trouble to set up that repo and it seems to me that tool is not being maintained well, especially for new llama models like 3.1/3.2. At least I was not able to do it successfully for llama 3.1 8b.

It would be great if we can add more native support for w4a8 quantization in the trtllm.

nv-guomingz · 2024-12-23T06:10:15Z

@Barry-Delaney would u plz add comments here?

bobboli · 2024-12-23T06:50:20Z

Hi,

Our officially supported toolkit for quantization is ModelOpt. We have discussed before and found that it is not trivial to land the techniques used by DeepCompressor (such as asymmetric quantization, double quantization, rotation, smoothing etc) into ModelOpt. At least in the near future we need to rely on DeepCompressor.

If you find problems on quantizing new models, could you try to do some implementation as DeepCompressor has an abstraction layer for models like this? You could also raise an issue in the DeepCompressor library and paste the errors in detail. The authors of DeepCompressor would be glad to answer.

Thank you!

ShuaiShao93 · 2024-12-23T16:59:52Z

Thanks! I have filed mit-han-lab/deepcompressor#38.

BTW if we can fix these issues: #2602, #2603, #2604, we can at least use w8a8. But today we can't even use w8a8.

ShuaiShao93 · 2025-01-04T00:13:17Z

I managed to build the qserve w4a8 with g128, and now it fails with

Traceback (most recent call last):
  File "/opt/conda/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 627, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
    engine = build_model(build_config,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 383, in build_model
    return build(model, build_config)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1237, in build
    model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 988, in forward
    hidden_states = self.transformer.forward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 311, in forward
    hidden_states = self.layers.forward(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 543, in forward
    hidden_states = layer(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 163, in forward
    attention_output = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 2793, in forward
    assert lora_layer_params is None, "lora is not supported on SmoothQuantAttention now"
AssertionError: lora is not supported on SmoothQuantAttention now

nv-guomingz assigned Barry-Delaney Dec 23, 2024

nv-guomingz added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Dec 23, 2024

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Better support for w4a8 quantization #2605

[Feature Request] Better support for w4a8 quantization #2605

ShuaiShao93 commented Dec 20, 2024

nv-guomingz commented Dec 23, 2024

bobboli commented Dec 23, 2024

ShuaiShao93 commented Dec 23, 2024

ShuaiShao93 commented Jan 4, 2025

[Feature Request] Better support for w4a8 quantization #2605

[Feature Request] Better support for w4a8 quantization #2605

Comments

ShuaiShao93 commented Dec 20, 2024

nv-guomingz commented Dec 23, 2024

bobboli commented Dec 23, 2024

ShuaiShao93 commented Dec 23, 2024

ShuaiShao93 commented Jan 4, 2025