You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, they key matrix is transposed first before being quantized. To simplify the streamlining and detection of the operator pattern in FINN, I would like to have the transpose directly in front of the MatMul operation - currently there will be a Quant Node and later on a MultiThreshold in between. Would it be generally ok to switch the order of these operations? Or is there more reasoning behind the current order, which I do not see? Maybe it yields better quantization statistics (if ever, should only apply in cases of channel-/group-wise quantization)?
For more context on the effort of streamlining the Brevitas exported QuantMultiheadAttention, please see Xilinx/finn#878
The idea of having the quantization just before the matmul is to avoid dealing with transposition of scale factors and zero point, especially in the case of per channel/per group quantization.
Although QuantTensor supports transpose even for quantization metadata, I would first need to check that it is robust enough in these cases so that we do not worry too much about transposing after quantization with different types of quantization.
Currently, they key matrix is transposed first before being quantized. To simplify the streamlining and detection of the operator pattern in FINN, I would like to have the transpose directly in front of the
MatMul
operation - currently there will be aQuant
Node and later on aMultiThreshold
in between. Would it be generally ok to switch the order of these operations? Or is there more reasoning behind the current order, which I do not see? Maybe it yields better quantization statistics (if ever, should only apply in cases of channel-/group-wise quantization)?For more context on the effort of streamlining the Brevitas exported QuantMultiheadAttention, please see Xilinx/finn#878
See here for the condition on location of the transpose operation I currently use for detecting the pattern: https://github.com/iksnagreb/attention-dummy/blob/infer-op/infer.py#L124
The text was updated successfully, but these errors were encountered: