Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Better support for w4a8 quantization #2605

Open
ShuaiShao93 opened this issue Dec 20, 2024 · 4 comments
Open

[Feature Request] Better support for w4a8 quantization #2605

ShuaiShao93 opened this issue Dec 20, 2024 · 4 comments
Assignees
Labels
Investigating Low Precision Issue about lower bit quantization, including int8, int4, fp8 triaged Issue has been triaged by maintainers

Comments

@ShuaiShao93
Copy link

Based on this doc, we have to use deepcompressor to build prepare fake-quantized checkpoint. However, it's a lot more trouble to set up that repo and it seems to me that tool is not being maintained well, especially for new llama models like 3.1/3.2. At least I was not able to do it successfully for llama 3.1 8b.

It would be great if we can add more native support for w4a8 quantization in the trtllm.

@nv-guomingz nv-guomingz added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Dec 23, 2024
@nv-guomingz
Copy link
Collaborator

@Barry-Delaney would u plz add comments here?

@github-actions github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 23, 2024
@bobboli
Copy link
Collaborator

bobboli commented Dec 23, 2024

Hi,

Our officially supported toolkit for quantization is ModelOpt. We have discussed before and found that it is not trivial to land the techniques used by DeepCompressor (such as asymmetric quantization, double quantization, rotation, smoothing etc) into ModelOpt. At least in the near future we need to rely on DeepCompressor.

If you find problems on quantizing new models, could you try to do some implementation as DeepCompressor has an abstraction layer for models like this? You could also raise an issue in the DeepCompressor library and paste the errors in detail. The authors of DeepCompressor would be glad to answer.

Thank you!

@ShuaiShao93
Copy link
Author

Thanks! I have filed mit-han-lab/deepcompressor#38.

BTW if we can fix these issues: #2602, #2603, #2604, we can at least use w8a8. But today we can't even use w8a8.

@ShuaiShao93
Copy link
Author

I managed to build the qserve w4a8 with g128, and now it fails with

Traceback (most recent call last):
  File "/opt/conda/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 627, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 425, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 390, in build_and_save
    engine = build_model(build_config,
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 383, in build_model
    return build(model, build_config)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 1237, in build
    model(**inputs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 988, in forward
    hidden_states = self.transformer.forward(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 311, in forward
    hidden_states = self.layers.forward(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py", line 543, in forward
    hidden_states = layer(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/models/llama/model.py", line 163, in forward
    attention_output = self.attention(
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/module.py", line 52, in __call__
    output = self.forward(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorrt_llm/quantization/layers.py", line 2793, in forward
    assert lora_layer_params is None, "lora is not supported on SmoothQuantAttention now"
AssertionError: lora is not supported on SmoothQuantAttention now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Investigating Low Precision Issue about lower bit quantization, including int8, int4, fp8 triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

4 participants