Converting Quantised PyTorch Models #40

NathanielB123 · 2024-05-10T19:42:55Z

NathanielB123
May 10, 2024

This is somewhere between a Nobuco feature request and just general questions about how quantisation in TensorFlow/TFLite works in case anybody who sees this knows.

Essentially my problem is that I have a quantised PyTorch model (manually quantised, so I am directly calling torch.quantize_per_tensor, torch.ops.quantized.linear, torch.ops.quantized.layer_norm etc...) and I want to convert this model to TFLite for deployment.

For the non-quantised model, Nobuco does a great job (#36 was a bit funky and there were a couple operators with missing support but adding the @nobuco.converters was surprisingly intuitive). For the quantised model, I am having a harder time.

One problem is that when converting, I get a ton of errors about quint8 and qint8 dtypes not being supported. I haven't yet dug into these error messages enough to figure out if this is a sign of a fundamental problem or just an obvious case not being handled somewhere, mainly because I've been more concerned by a second issue...

Specifically, intercepting the quantised ops with @nobuco.converters works like any other built-in PyTorch operators, but I am struggling to find TensorFlow functions to replace them with. torch.quantize_per_tensor becomes tf.quantization.quantize and torch.Tensor.dequantize becomes tf.quantization.dequantize but beyond that, as far as I can tell, TensorFlow doesn't expose quantised versions of stuff like dense layers (though it does expose quantised concatenation... for some reason?). My best guess for why, looking at the APIs for TensorFlow quantisation, is that these operators don't really exist inside TensorFlow itself, only in TFLite.

BUT of course it is possible to have models in TensorFlow that "act" like they are quantised (both during training in TensorFlow and when exporting to TFLite) - this is necessary for quantisation-aware training to work! The problem is that the APIs to create these seem really opaque. Instead of creating individual quantised layers and being able to set their zero_point and scale you just call tfmot.quantization.keras.quantize_model and this does a bunch of stuff behind-the-scenes.

So, context covered, specific questions time (to anyone who thinks they might know):

Is my summary above correct? Have I just missed some obvious way to create quantised layers directly with TensorFlow?
Even if it isn't obvious, perhaps if tfmot.quantization.keras.quantize_model works under-the-hood by inserting fake quantisation layers around ordinary TensorFlow operators (and then the TFLite exporter looks for these fake quantisation layers and replaces them with actual quantised operators), then I could try to do a similar thing inside the @nobuco.converters? Does this sound realistic/feasible?
Alternatively, I've been considering trying a really janky work-around along the lines of treating each layer as it's own model, applying PTQ and then sending a tensor of the minimum and a tensor of the maximum value through, hoping that this has the effect of creating a quantised layer with my desired min/max value set appropriately. Is this insane?
Am I barking up the wrong tree completely here? Is there a different obvious way I should be looking into for deploying quantised PyTorch models? (I know technically ExecuTorch exists but it is super new and right now my model really hates torch.export)
I guess an alternate approach to exporting the quantised model directly is I could try and edit the non-quantised TFLite model, adding in the quantise/dequantise layers and setting the scales and zero-points myself. Of course, TFLite is not the easiest to edit of file formats...
And finally... any chance Nobuco could add magical support for exporting quantised PyTorch models?🥺👉👈 (if my above summary is correct, then this seems to me like it would be quite difficult, but hey I thought I'd ask just-in-case, lol)

crimson206 · 2024-05-11T08:04:22Z

crimson206
May 11, 2024

Have you ever tried to writing your custom nobuco.converter? It might be much easier than you think. When I have time, I would also try a bit. Although I am not the owner of this project, I think your question is just valid for this project. Nobuco is well structured to convert from the fundamental torch functions to the entire model recursively. One person uses only small parts of torch, but we as a group use almost all the parts of torch. If each of us implements Unimplemented nodes for our own sake, Nobuco will be powered to convert most of torch models at some point.

1 reply

NathanielB123 May 11, 2024
Author

I have written my own converters! Sorry maybe I wasn't clear in my post, I have written converters for quantize and dequantize (not convinced they are fully correct but I think they do approximately the right thing). The problem is I can't find any TensorFlow equivalent functions to be able to convert the actual quantised operations like torch.ops.quantized.linear.

Minimal code example:

import torch
from torch import nn
from torch._ops import ops
import nobuco
from nobuco import ChannelOrder, ChannelOrderingStrategy
import tensorflow as tf


class Model(nn.Module):
    def __init__(self, weight, bias):
        super().__init__()
        weight = torch.quantize_per_tensor_dynamic(weight, torch.qint8, False)
        self.packed = torch.ops.quantized.linear_prepack(weight, bias)

    def forward(self, x: torch.Tensor):
        x = torch.quantize_per_tensor(x, 1.0, 128, torch.quint8)
        x = ops.quantized.linear(x, self.packed, 1.0, 128)
        x = x.dequantize()
        return x


@nobuco.converter(torch.quantize_per_tensor)
def quantise(x: torch.Tensor, scale: float, zero: int, dtype: torch.dtype):
    assert dtype == torch.quint8

    min_q = -zero * scale
    max_q = (255 - zero) * scale

    def func(x: tf.Tensor, *args):
        return tf.quantization.quantize(x, min_q, max_q, tf.quint8)[0]

    return func


@nobuco.converter(torch.Tensor.dequantize)
def dequantise(x: torch.Tensor):
    scale, zero = x.q_scale(), x.q_zero_point()
    min_q = -zero * scale
    max_q = (255 - zero) * scale

    def func(x: tf.Tensor):
        return tf.quantization.dequantize(x, min_q, max_q)

    return func


@nobuco.converter(ops.quantized.linear)
def linear_quantized(x: torch.Tensor, packed, out_scale, out_zero):
    weight, bias = ops.quantized.linear_unpack(packed)

    def func(x: tf.Tensor):
        # TODO: How to implement this?
        pass

    return func


def convert(model: nn.Module):
    keras_model = nobuco.pytorch_to_keras(
        model,
        args=[torch.rand((1, 10))],
        kwargs={},
        inputs_channel_order=ChannelOrder.TENSORFLOW,
        outputs_channel_order=ChannelOrder.TENSORFLOW,
    )
    return keras_model


model = Model(torch.rand((10, 10)), torch.rand((10,)))
converted = convert(model)

Note all the errors about unsupported dtypes, but ALSO note that I have no clue how to implement the converter for torch.ops.quantized.linear, which is my main concern.

AlexanderLutsenko · 2024-05-13T16:52:19Z

AlexanderLutsenko
May 13, 2024
Maintainer

One problem is that when converting, I get a ton of errors about quint8 and qint8 dtypes not being supported.

I tried to add support for these dtypes lately, and that's a bigger task than I expected. For starters, quantized tensors in Tensorflow are not even tensors but namedtuples, and thus incompatible with the existing codebase. I could make it work, of course, but not sure if I should, because...

Like you said, quantized ops are nowhere to be seen in Tensorflow.

Even if it isn't obvious, perhaps if tfmot.quantization.keras.quantize_model works under-the-hood by inserting fake quantisation layers around ordinary TensorFlow operators (and then the TFLite exporter looks for these fake quantisation layers and replaces them with actual quantised operators), then I could try to do a similar thing inside the @nobuco.converters?

Yes, QuantizeWrapper looks and feels very similar to QuantWrapper in Pytorch. What you propose is feasible... if the quantization policies match.

Alternatively, I've been considering trying a really janky work-around along the lines of treating each layer as it's own model, applying PTQ and then sending a tensor of the minimum and a tensor of the maximum value through, hoping that this has the effect of creating a quantised layer with my desired min/max value set appropriately. Is this insane?

Dunno, my experience with that stuff is very limited.

Am I barking up the wrong tree completely here? Is there a different obvious way I should be looking into for deploying quantised PyTorch models?

I wouldn't have created Nobuco If I knew a better way to deploy models on the mobile and the web.

I guess an alternate approach to exporting the quantised model directly is I could try and edit the non-quantised TFLite model, adding in the quantise/dequantise layers and setting the scales and zero-points myself. Of course, TFLite is not the easiest to edit of file formats...

TFLite converter can perform post-training quantization automatically, why would do the same thing by hand?

And finally... any chance Nobuco could add magical support for exporting quantised PyTorch models?

I'd like to, I just don't have a good idea of how to approach that. Looks like there are lots of things to consider.
Meanwhile, I think it's worth a shot to ask some questions here: PINTO0309/onnx2tf.

3 replies

NathanielB123 May 14, 2024
Author

Thanks for the answers!

Good to know about the dtype situation. Any chance you could clarify what you mean by TensorFlow representing quantised tensors as namedtuples? https://www.tensorflow.org/api_docs/python/tf/quantization/quantize does return a tuple of tensors, but I thought just taking the first tensor would be fine? Unless I guess you need to keep the output_min and output_max to dequantise later - but in theory given the scales and zero points are constant by the time you are exporting from PyTorch, I don't think this should be necessary? Are there other functions which work with namedtuples I am missing?

If the dtype stuff is resolvable, then QuantizeWrapper sounds like a good potential solution. I guess the immediate puzzle is working out whether it is possible to set the scale and zero point manually (https://discuss.tensorflow.org/t/while-doing-quantization-is-it-possible-to-specify-the-scale-and-zero-point-to-tensorflow-int8-kernel/5486/5) implies it isn't (or at least wasn't, ~1 year ago) but the QuantizeConfig API does look reasonably promising...

On doing PTQ with the TFLite converter, this is actually what I started with, but for some reason it absolutely destroyed output quality (the model in a GNN, which support for in general seems super flaky so it might just be a bug). Doing PTQ in PyTorch gives much better results (and also I have the option of doing QAT if I need even more accuracy).

Finally, thanks for the pointer to onnx2tf - I do actually have exporting to a quantised onnx model working, but it being in QDQ format (where every quantised operator looks like dequantise->operator->quantise) I think would cause issues with trying to convert to TFLite naively.

AlexanderLutsenko May 14, 2024
Maintainer

Unless I guess you need to keep the output_min and output_max to dequantise later - but in theory given the scales and zero points are constant by the time you are exporting from PyTorch, I don't think this should be necessary?

Yes, I need to keep them around. One reason is that torch.Tensor.dequantize() does not accept additional inputs, so when I write its TF counterpart, I can only get the quantization parameters from the tensor itself.

If the dtype stuff is resolvable, then QuantizeWrapper sounds like a good potential solution.

As I can see it, the QuantizeWrapper path should not require to expose quantized dtypes at all, as they would be contained inside the wrapper.

On doing PTQ with the TFLite converter, this is actually what I started with, but for some reason it absolutely destroyed output quality

This mirrors my experience with vision models: complete garbage after full integer quantization.

NathanielB123 May 14, 2024
Author

My attempt at implementing dequantize was:

@nobuco.converter(torch.Tensor.dequantize)
def dequantise(x: torch.Tensor):
    scale, zero = x.q_scale(), x.q_zero_point()
    min_q = -zero * scale
    max_q = (255 - zero) * scale

    def func(x: tf.Tensor):
        return tf.quantization.dequantize(x, min_q, max_q)

    return func

(i.e. estimating the min and max from the torch tensor, and assuming those values don't change when it gets called in the tf model) - I feel like this ought to work but maybe my model of how Nobuco's converters work is wrong? Regardless I guess there are also other parts of the code where they are needed - I made a bit of an attempt today to add the types myself and I think the code for comparing outputs for checking accuracy might need the parameters as well (to convert from TensorFlow to PyTorch tensors).

Interesting to hear about QuantizeWrapper - I haven't looked that closely at how it works yet. I guess my original plan was to convert quantised PyTorch operations like torch.ops.quantized.linear to QuantizeWrapper-wrapped TensorFlow operations, which I feel like would need at least a little support for handling the PyTorch tensors that are torch.qint8/torch.quint8 (i.e. the output from the QuantizeWrapper-wrapped TensorFlow op would have to be compared against the output from the quantized PyTorch operator - for this to work maybe I should be mapping torch.qint8/torch.quint8 to tf.float32?) I guess an alternative would be to change my PyTorch model to use QuantWrapper and learn how that works...

Nice to know I'm not the only one getting terrible quality from PTQ in TensorFlow lol.

AlexanderLutsenko · 2024-05-15T14:39:31Z

AlexanderLutsenko
May 15, 2024
Maintainer

Alright, I think I'm onto something here. Got it to work for a simple quantized model.

Try it:

pip install https://github.com/AlexanderLutsenko/nobuco/archive/quantized.zip

Whether the converted model will be properly quantized by TFLite is another question.

1 reply

NathanielB123 May 16, 2024
Author

Woah nice! At least with the TFLite benchmark tool using XNNPack on x86, the fully connected layer in the exported TFLite model is still ran in float32, but just being able to export a model that produces correct results is still cool.

Surrounding operators with dequantising and quantising again feels a lot like QDQ-style quantisation in ONNX, but yeah looks like somewhere either in the converter or interpreter/delegate it isn't recognising that we would like to run the fully connected layer in int8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting Quantised PyTorch Models #40

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Converting Quantised PyTorch Models #40

NathanielB123 May 10, 2024

Replies: 3 comments · 5 replies

crimson206 May 11, 2024

NathanielB123 May 11, 2024 Author

AlexanderLutsenko May 13, 2024 Maintainer

NathanielB123 May 14, 2024 Author

AlexanderLutsenko May 14, 2024 Maintainer

NathanielB123 May 14, 2024 Author

AlexanderLutsenko May 15, 2024 Maintainer

NathanielB123 May 16, 2024 Author

NathanielB123
May 10, 2024

Replies: 3 comments 5 replies

crimson206
May 11, 2024

NathanielB123 May 11, 2024
Author

AlexanderLutsenko
May 13, 2024
Maintainer

NathanielB123 May 14, 2024
Author

AlexanderLutsenko May 14, 2024
Maintainer

NathanielB123 May 14, 2024
Author

AlexanderLutsenko
May 15, 2024
Maintainer

NathanielB123 May 16, 2024
Author