-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec quantization #588
Comments
Similarly to the discussion in #8, the current plan is to wait until: 1) we finish speccing statically-shaped StableHLO ops, 2) we formalize syntax and maybe even evaluation of StableHLO programs (#484). Once that's done, I'm planning to provide a delta to this formalism that adds support for quantization. I think this won't be too hard and will result in a higher-quality design, because we'll be forced to explore more details. |
Earlier today, in conclusion of the Q3/Q4 speccing marathon, we have finished speccing HLO semantics for the StableHLO ops. This was a huge effort that involved writing 93 specs, including digging deep into involved semantics of ops like batch_norm_grad, convolution, dot_general and more. Congratulations to everyone who contributed to this important milestone! The idea of this project was to create a baseline from which the StableHLO opset will evolve in the future. Our immediate next steps will be writing a dynamism RFC (#8) and speccing quantization (#588) on top of this baseline. Also, this speccing marathon has uncovered a lot of future work - both in cleaning up the opset and improving the implementation to fully conform to the spec. This is something that we're aiming to address in the next year.
See #1149 for a proposal for how to spec quantization in StableHLO in the context of alignment with TOSA. |
https://github.com/subhankarshah/stablehlo/blob/spec-quantization/docs/spec.md |
[Action Item]: Verify the element-type of return value in |
@subhankarshah when will the UniformQuantizeOp be specced? I am wondering what rounding modes will be supported in that op? |
Hi @mahmoud-abuzaina Thanks for your interest! |
StableHLO dialect currently supports quantization via: 1) Supporting `quant.uniform` element types. 2) Having dedicated ops like `uniform_quantize` / `uniform_dequantize`. 3) Allowing regular ops like `add` / `convolution` to take quantized tensors. This support was inherited from MHLO when StableHLO was bootstrapped, and MHLO support was motivated by mobile use cases and inherited from TFLite. As pointed out in #1149, StableHLO specification doesn't support quantization at the moment, and this is an important gap that we would like to fix before StableHLO v1.0 (see #588). To continue the discussion started in #1149 and to make progress towards v1.0, this pull request: A) Adds QuantizedType to the StableHLO specification, modelled after [TFLite quantization spec](https://www.tensorflow.org/lite/performance/quantization_spec). B) To start a conversation about the applications of QuantizedType and the semantics of quantized ops, proposes semantics for quantized `add`. TFLite quantization spec doesn't cover everything. It specs constraints on types (which we captured accordingly in this pull request), but it doesn't go into describing semantics of quantized ops. As a result, the proposed semantics for quantized `add` is intentionally naive, as compared with the much more involved implementations in the TensorFlow repository, e.g.: * [tfl.add](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/add.cc). * [tf.UniformQuantizedAdd](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/uniform_quant_ops/uniform_quantized_add_op.cc). upd: After community discussion, we removed the spec for quantized `add` leaving that for future work, since further alignment is required. --------- Co-authored-by: Eugene Burmako <[email protected]>
Hello, I'm interested in implementing an e2e example for llama and related models utilizing a stableHLO quantized matmul that should meet or exceed the performance (ppl and tokens/ms) of llama.cpp on CPU. Hopefully we can lower to a BLAS library. |
Thanks @jon-chuang for your sharing your interest and I am super excited to explore together. A few clarification questions:
cc @GleasonK |
The answer should be anything that can be lowered to HLO, including pytorch, onnx etc. Llama weights I think is originally pytorch/huggingface.
It's unclear at the moment. For instance, int4 quantization is not rly meaningful for GPU to my understanding due to lack of int4 matmul units and high overhead. |
Thanks @jon-chuang, for the clarification. Let me get back on this. |
Hi @jon-chuang Overall, On a side note: While exploring on it, I found a few interesting and relevant discussions in discord#jax which you might be interested in. |
Some of the stablehlo ops does not have support for quantized types in their tablegen specification, prohibits writing StableHLO quantized programs using those ops. The PR is about adding the missing support for the following ops. Also, I believe the ongoing specification [work](#588), should not deviate much from the proposed changes here. ``` stablehlo.atan2 stablehlo.divide stablehlo.power stablehlo.remainder stablehlo.subtract stablehlo.abs stablehlo.cbrt stablehlo.cosine stablehlo.exponential stablehlo.exponential_minus_one stablehlo.log stablehlo.log_plus_one stablehlo.logistic stablehlo.negate stablehlo.rsqrt stablehlo.sign stablehlo.sine stablehlo.sqrt stablehlo.tanh stablehlo.cholesky stablehlo.triangular_solve ``` Other than these ops, we have `fft`, `rng`, and `rng_bit_generator` (or something else which I might be missing) which could be potential candidates for the support. I propose that we add the support after adding the specification of those op as adding the support might need some non-trivial discussion.
Hello All
We are planning to open separate tickets for (1) and (2). Regarding (3), we are having some ongoing work in exporting quantized PyTorch models to StableHLO (ref)[https://github.com/pytorch/xla/pull/5763]. We will be happy to understand/address any specific quantization specification in separate tickets. |
This will involve documenting: 1) a representation for quantized tensors, 2) UniformQuantizeOp, 3) UniformDequantizeOp, 4) which of the existing ops can take quantized tensors and how their semantics changes from that.
The text was updated successfully, but these errors were encountered: