RFC for aligning StableHLO and TOSA arithmetic

Initial version of an RFC to discuss aligning StableHLO and TOSA arithmetic operation. Signed-off-by: Eric Kunze <[email protected]>
openxla · Feb 10, 2023 · e8d168b · e8d168b
1 parent d057878
commit e8d168b
Showing 1 changed file with 136 additions and 0 deletions.
diff --git a/rfcs/20230210-stablehlo-tosa.md b/rfcs/20230210-stablehlo-tosa.md
@@ -0,0 +1,136 @@
+## RFC: Align StableHLO and TOSA arithmetic
+
+| Status        | Proposed                                                |
+| :------------ | :------------------------------------------------------ |
+| **RFC #**     | |
+| **Author(s)** | Eric Kunze ([email protected]), Dominic Symes ([email protected]), Tom Cooksey ([email protected]) |
+| **Updated**   |                                               |
+
+## Objective
+
+Align the operations of StableHLO and TOSA.
+
+## Proposal
+
+*  Define the arithmetic operation of StableHLO to enable implementation via TOSA operators.
+*  Enable implementation of fully quantized networks on systems without floating point hardware such as [Arm Ethos-U55](https://developer.arm.com/documentation/102420/0200/Programmers-model/Operators-and-performance/Operations).
+
+## Background
+
+Both StableHLO as well as TOSA provide a set of operations for processing ML models.
+They operate at adjacent points in the compilation stack, providing stability points for implementations.
+StableHLO is aimed closer to the framework and compiler, while TOSA is focused closer to hardware implementations and smaller systems.
+
+Fully quantized networks have converted all of their operators into integer based operations.
+Integer operations are generally lower power, and often faster than doing the equivalent floating-point operation.
+The silicon area required to implement integer operations is also significantly smaller, leading some accelerators to omit floating-point circuitry compltely, particularly in the mobile/embedded market.
+The weights and activations when quantized down to 8 or 16 bits also require less memory traffic due to their smaller size.
+
+Fully integer networks also fully guarantee reproducibility across completely independent implementations.
+They can be tested bit accurately, and there is no need for fuzz testing to capture possible errors.
+
+### TOSA
+Tensor Operator Set Architecture ([TOSA](https://mlplatform.org/tosa)) provides a set of whole-tensor operations commonly employed by Deep Neural Networks.
+The intent of TOSA is to enable a variety of implementations running on a diverse range of processors, with the results at the TOSA level consistent across those implementations.
+This consistency includes systems such as microcontrollers, where the cost of a floating-point multiplier is a significant burden in area and power.
+
+TOSA requires that results are bit accurately defined for integer operations.
+This requirement is in place to ensure portability across implementations.
+
+TOSA provides a conformance test suite and reference model to verify that implementations match the expected behavior.
+
+There is a TOSA dialect representing the specification in the MLIR repository, evolving to match the TOSA specification.
+
+TOSA Operations don't accept quantized tensors, quantization has been moved to explicit RESCALE operations.
+RESCALE is an integer multiply, add, and shift, operations widely available.
+Explicit zero points where needed for asymmetric quantization.
+
+### Example TOSA expanded quantized operation
+
+This is an example of an expanded quantized add operation.
+Before performing the integer addition, both inputs must be brought to a common scale.
+Multiple options are available for the multiplier and shift.
+This allows the scaling to be customized to match the behavior of the original framework.
+
+**Framework operator**
+```
+%framework_sum = framework.add %a %b :
+    (tensor<2x2x!quant.uniform<i8:f32, 0.025:-1>>, tensor<2x2x!quant.uniform(i8:f32, .075:-1)>>)
+    -> tensor<2x2x!quant.uniform(i8:f32, 0.15:-1)>
+```
+
+**TOSA quantized implementation**
+```
+%scaled_a = tosa.rescale %a
+    {input_zp = -1 : i32, multiplier = [1431655765 : i32], shift = [13 : i32]}} :
+    (tensor<2x2xi8>) -> tensor<2x2xi32>
+%scaled_b = tosa.rescale %b
+    {input_zp = -1 : i32, multiplier = [1073741824 : i32], shift = [11 : i32]}} :
+    (tensor<2x2xi8>) -> tensor<2x2xi32>
+%sum = tosa.add %scaled_a %scaled_b :
+    (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32>
+%scaled_sum = tosa.rescale %sum
+    {{input_zp = -1 : i32, multiplier = [[1073741824 : i32], shift = [50 : i32]}} :} :
+    (tensor<2x2xi32>) -> tensor<2x2xi8>>
+```
+
+In this instance, the quantization has been removed from the input and output tensors.
+The scale and zero point that are in the quantization parameters has been moved as the multiplier and shift included in the RESCALE operation's attributes.
+The TOSA RESCALE operation does not use the quantization parameters, only the multiplier and shift attributes that are part of the operation.
+Each step along this path is well defined in terms of integer arithmetic, therefore all implementations will return the same result.
+If a floating-point operation was used in rescale, it opens up the possibility of bit errors and leads to issues for implementation on integer only devices.
+This can lead to varying results between different devices/implementations.
+With TOSA's guarantee, portability across implementations is much easier.
+
+### Quantization in StableHLO
+
+StableHLO accepts quantized tensors for operations, but the behavior in the existing specification is not defined.
+This RFC references the appendix of the FP8 in XLA [RFC](https://github.com/openxla/xla/discussions/22) to get an understanding of how quantization is implemented in StableHLO.
+
+As described in the FP8 RFC, the quantization itself does not appear to significantly differ from what TOSA does, much of the work required to align is making the underlying behavior explicit.
+Once the behavior is explicit, then comparisons to TOSA quantization behavior can be made and it should be possible to align the behavior.
+One key holdup in the current StableHLO design is that the quantization scale is kept as a floating-point number.
+This adds ambiguity for quantized values, as the value makes it impossible to be described as a fully integer based operation that could be run on an integer only processor, and prevents bit accurate checking of the results.
+
+## Design Proposal
+
+To improve the experience for developers using StableHLO to express fully quantized models, align the arithmetic operations with TOSA and it's handling of quantized operations to ensure networks compiled using StableHLO return the same results across implementations.
+
+### Make quantized StableHLO operations explicit.
+
+Define which operations allow quantized input/output tensors.
+Do all tensors which accept integers accept quantized tensors?
+
+Convert the quantization parameters to an integer multiplier and shift.
+
+Add a new StableHLO operator rescale, which performs the same function as the TOSA RESCALE operator.
+
+Add a pass to replace the quantized tensors with integer tensors.
+A TOSA pass exists for this purpose, so could be adapted to perform the same operation with StableHLO operators.
+
+For each operator that accepts a quantized tensor, define the behavior using methods which align with TOSA operations.
+Define any intermediate data sizes that would have effects on the results (i.e. accumulator size for convolution)
+
+## Open Issues
+
+**TOSA does not currently define FP8 operations.**
+
+As described in the StableHLO FP8 RFC, they may be able to be treated as uniform quantized tensors within the dialects.
+This can serve as another alignment point but is not addressed in this RFC.
+
+**Dynamic quantization scaling during training.**
+
+Also as described in the FP8 RFC, dynamic scaling is a desirable feature for training.
+Neither StableHLO nor TOSA currently support dynamic scale support.
+Both dialects would need updating, although it does not appear that the operators in the specifications would need to significantly change.
+
+**Floating-point compound operator accuracy requirements.**
+
+The next draft of the TOSA specification includes a proposed error bound for compound operations such as convolution. This could be another point for collaboration.
+
+## References
+
+1. [TOSA Specification](https://www.mlplatform.org/tosa/tosa_spec.html)
+2. [StableHLO Specification](https://github.com/openxla/stablehlo/blob/main/docs/spec.md)
+3. [StableHLO FP8 RFC](https://github.com/openxla/xla/discussions/22)
+4. [Arm Ethos-U55 NPU](https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55)