Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec quantization #588

Closed
burmako opened this issue Nov 24, 2022 · 13 comments
Closed

Spec quantization #588

burmako opened this issue Nov 24, 2022 · 13 comments
Assignees

Comments

@burmako
Copy link
Contributor

burmako commented Nov 24, 2022

This will involve documenting: 1) a representation for quantized tensors, 2) UniformQuantizeOp, 3) UniformDequantizeOp, 4) which of the existing ops can take quantized tensors and how their semantics changes from that.

@burmako burmako added the Spec label Nov 24, 2022
@burmako burmako self-assigned this Nov 24, 2022
@burmako
Copy link
Contributor Author

burmako commented Nov 24, 2022

Similarly to the discussion in #8, the current plan is to wait until: 1) we finish speccing statically-shaped StableHLO ops, 2) we formalize syntax and maybe even evaluation of StableHLO programs (#484). Once that's done, I'm planning to provide a delta to this formalism that adds support for quantization. I think this won't be too hard and will result in a higher-quality design, because we'll be forced to explore more details.

burmako pushed a commit that referenced this issue Dec 14, 2022
Earlier today, in conclusion of the Q3/Q4 speccing marathon, we have
finished speccing HLO semantics for the StableHLO ops.

This was a huge effort that involved writing 93 specs, including digging
deep into involved semantics of ops like batch_norm_grad, convolution,
dot_general and more. Congratulations to everyone who contributed to
this important milestone!

The idea of this project was to create a baseline from which the
StableHLO opset will evolve in the future. Our immediate next steps will
be writing a dynamism RFC (#8) and speccing quantization (#588) on top
of this baseline.

Also, this speccing marathon has uncovered a lot of future work - both
in cleaning up the opset and improving the implementation to fully
conform to the spec. This is something that we're aiming to address in
the next year.
@burmako burmako assigned subhankarshah and unassigned burmako Jan 27, 2023
@burmako
Copy link
Contributor Author

burmako commented Feb 10, 2023

See #1149 for a proposal for how to spec quantization in StableHLO in the context of alignment with TOSA.

@subhankarshah
Copy link
Member

@ghpvnist
Copy link
Member

[Action Item]: Verify the element-type of return value in inferConvolutionOp in TypeInference.cpp as noted in #1314 (comment)

@mahmoud-abuzaina
Copy link

@subhankarshah when will the UniformQuantizeOp be specced? I am wondering what rounding modes will be supported in that op?

@sdasgup3
Copy link
Member

Hi @mahmoud-abuzaina Thanks for your interest!
We are currently exploring options around speccing ops with quantized types: determining constraints on quantization parameters, types including the rounding mode involved during quantization. You can expect relevant PRs started pouring pouring in for review in Q1'23 and early Q2'23.

burmako pushed a commit that referenced this issue Apr 14, 2023
StableHLO dialect currently supports quantization via:
  1) Supporting `quant.uniform` element types.
  2) Having dedicated ops like `uniform_quantize` / `uniform_dequantize`.
  3) Allowing regular ops like `add` / `convolution` to take quantized
tensors.

This support was inherited from MHLO when StableHLO was bootstrapped,
and MHLO support was motivated by mobile use cases and inherited from
TFLite.

As pointed out in #1149, StableHLO specification doesn't support
quantization at the moment, and this is an important gap that we would 
like to fix before StableHLO v1.0 (see #588).

To continue the discussion started in #1149 and to make progress towards
v1.0, this pull request:
  A) Adds QuantizedType to the StableHLO specification, modelled after
[TFLite quantization
spec](https://www.tensorflow.org/lite/performance/quantization_spec).
  B) To start a conversation about the applications of QuantizedType and
the semantics of quantized ops, proposes semantics for quantized `add`.

TFLite quantization spec doesn't cover everything. It specs constraints
on types (which we captured accordingly in this pull request), but it
doesn't go into describing semantics of quantized ops.

As a result, the proposed semantics for quantized `add` is intentionally
naive, as compared with the much more involved implementations in the
TensorFlow repository, e.g.:
  *
[tfl.add](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/add.cc).
  *
[tf.UniformQuantizedAdd](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/uniform_quant_ops/uniform_quantized_add_op.cc).

upd: After community discussion, we removed the spec for quantized
`add` leaving that for future work, since further alignment is required.

---------

Co-authored-by: Eugene Burmako <[email protected]>
@jon-chuang
Copy link

jon-chuang commented May 16, 2023

Hello, I'm interested in implementing an e2e example for llama and related models utilizing a stableHLO quantized matmul that should meet or exceed the performance (ppl and tokens/ms) of llama.cpp on CPU. Hopefully we can lower to a BLAS library.

@sdasgup3
Copy link
Member

sdasgup3 commented May 16, 2023

Thanks @jon-chuang for your sharing your interest and I am super excited to explore together. A few clarification questions:

  1. Do you expect the source llama model to be in C++ or it is OK for the model to be expressed using other frameworks like PyTorch?
  2. Are you also interested in exploring the performance numbers in platforms other than CPU?

cc @GleasonK

@jon-chuang
Copy link

Do you expect the source llama model to be in C++ or it is OK for the model to be expressed using other frameworks like PyTorch

The answer should be anything that can be lowered to HLO, including pytorch, onnx etc. Llama weights I think is originally pytorch/huggingface.

Are you also interested in exploring the performance numbers in platforms other than CPU?

It's unclear at the moment. For instance, int4 quantization is not rly meaningful for GPU to my understanding due to lack of int4 matmul units and high overhead.

@sdasgup3
Copy link
Member

sdasgup3 commented May 16, 2023

Thanks @jon-chuang, for the clarification. Let me get back on this.

@sdasgup3
Copy link
Member

Hi @jon-chuang

Overall, e2e example for llama and related models utilizing a StableHLO quantized matmul is very exciting to have and we have actively started working on gathering what's needed in StableHLO to represent quantization in cutting-edge LLMs (#1491). Feel free to have an eye on future updates.

On a side note: While exploring on it, I found a few interesting and relevant discussions in discord#jax which you might be interested in.

sdasgup3 added a commit that referenced this issue Jun 29, 2023
Some of the stablehlo ops does not have support for quantized types in
their tablegen specification, prohibits writing StableHLO quantized
programs using those ops. The PR is about adding the missing support for
the following ops. Also, I believe the ongoing specification
[work](#588), should not
deviate much from the proposed changes here.

```
stablehlo.atan2
stablehlo.divide
stablehlo.power
stablehlo.remainder
stablehlo.subtract

stablehlo.abs
stablehlo.cbrt
stablehlo.cosine
stablehlo.exponential
stablehlo.exponential_minus_one
stablehlo.log
stablehlo.log_plus_one
stablehlo.logistic
stablehlo.negate
stablehlo.rsqrt
stablehlo.sign
stablehlo.sine
stablehlo.sqrt
stablehlo.tanh

stablehlo.cholesky
stablehlo.triangular_solve
```

Other than these ops, we have `fft`, `rng`, and `rng_bit_generator` (or
something else which I might be missing) which could be potential
candidates for the support. I propose that we add the support after
adding the specification of those op as adding the support might need
some non-trivial discussion.
@sdasgup3
Copy link
Member

sdasgup3 commented Dec 6, 2023

Hello All
With reduction based operations we are planning to close this current issue related to quantization specification. The remaining items are:

  1. [Action Item]: Verify the element-type of return value in inferConvolutionOp in TypeInference.cpp as noted in Add interpreter for ConvolutionOp #1314 (comment)
  2. See RFC for aligning StableHLO and TOSA arithmetic #1149 for a proposal for how to spec quantization in StableHLO in the context of alignment with TOSA.
  3. e2e example for llama and related models utilizing a stableHLO

We are planning to open separate tickets for (1) and (2). Regarding (3), we are having some ongoing work in exporting quantized PyTorch models to StableHLO (ref)[https://github.com/pytorch/xla/pull/5763]. We will be happy to understand/address any specific quantization specification in separate tickets.

@sdasgup3
Copy link
Member

sdasgup3 commented Jan 2, 2024

We have added #1896 and #1898 to track 1 and 2 resp. With that we are closing the current ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants