Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] PerChannel Setting #707

Open
3 tasks done
Coco58323 opened this issue Dec 27, 2024 · 1 comment
Open
3 tasks done

[REQUEST] PerChannel Setting #707

Coco58323 opened this issue Dec 27, 2024 · 1 comment

Comments

@Coco58323
Copy link

Problem

Current implementation has not supported Perchannel Quantization. Would consider add it into config?

Solution

Implements per-channel quantization parameters in conversion/qparams.py and optimizes GEMM computation through deferred scaling, where column-wise scaling factors are applied to the final accumulation results to minimize arithmetic operations and improve computational efficiency.

Alternatives

No response

Explanation

For higher than 4 bit, the accuracy is nearly lossless with per-channel quantization. It would be efficient for inference if the per-channel setting got support.

Examples

No response

Additional context

No response

Acknowledgements

  • I have looked for similar requests before submitting this one.
  • I understand that the developers have lives and my issue will be answered when possible.
  • I understand the developers of this program are human, and I will make my requests politely.
@turboderp
Copy link
Member

Could you elaborate? EXL2 already has one FP16 scale per output channel, as well as a 4-bit scale per group of weights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants