Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does there have to be QuantIdentity as a first layer? #790

Closed
phixerino opened this issue Jan 5, 2024 · 8 comments
Closed

Does there have to be QuantIdentity as a first layer? #790

phixerino opened this issue Jan 5, 2024 · 8 comments

Comments

@phixerino
Copy link

In a few examples (LeNet, CNV) there is QuantIdentity to quantize the input, but in others (ResNet, ImageNet examples) there isnt. So is it beneficial or not? And what data type does then the network expects?

Btw. my aim is to export the network to FINN. Thanks

@fabianandresgrob
Copy link
Contributor

fabianandresgrob commented Jan 5, 2024

Hi @phixerino,

Thanks for your question. The QuantIdentity is merely quantizing a tensor that you put in. In other words, it is calculating the quantization parameters for your input and returns a QuantTensor (if you set return_quant_tensor=True).
If you specify a quantizer for the layer, i.e. QuantLinear(2, 4, input_quant=Int8ActPerTensorFloat, bias=False), you don't need to use the QuantIdentity layer. If you don't specify the input_quant in the layer, then you should use the QuantIdentity layer. Algorithmically, these two options are doing the exact same thing (given you choose the same quantizer). We provide these options as they can make a difference when exporting the model, i.e. to ONNX format. So it really depends on your use case if it's beneficial or not.
You can see this in more detail here or here.

Data type should be the same :)

@phixerino
Copy link
Author

Thank you, I understand. But why isnt there the input_quant=Int8ActPerTensorFloat in the first layer of ResNet or any of the ImageNet examples?

@fabianandresgrob
Copy link
Contributor

Currently, when quantizing models using src/brevitas_examples/imagenet_classification/ptq/ptq_evaluate.py, these settings are applied when the method quantize_model() is called. Basically, the original layers are replaced by their quant counterparts and the input_quant is set according to the configuration you pass. You can check out this method to see how it is done, using the debugger to go through it helps.
We are working to expose these methods and provide an easier workflow.

@phixerino
Copy link
Author

I see. So when I want to use QAT I need to put input_quant=Int8ActPerTensorFloat? And if I didnt, then the input to my first layer would not be quantized, right? Then how does the quantization work when the weights of the layer are quantized? I am trying to figure out how would that impact the speed of model inference on FPGA.

@fabianandresgrob
Copy link
Contributor

Usually, you would want your input to be quantized, so you need to specify input_quant or use QuantIdentity. In this example, the input is expected to be already quantized to 8 bits. If you don't quantize your input, you'll basically multiply non-restricted floating point inputs with your quantized weights, leading to a higher bit-width output. That means storing intermediate outputs need a higher bit width. Usually the workflow is to quantize your input and weights, do the forward pass, and then quantize the output again, so it is the quantized input for the next layer. Hope that helps.

@phixerino
Copy link
Author

Thank you, it does help a lot. I'm guessing that the quantized input to the next QuantConv2d layer is done by QuantReLU, because it has return_quant_tensor=True. Is there any advantage to have return_quant_tensor=True also in the QuantConv2d layer?

Also if I set input_quant=Int8ActPerTensorFloat, does it matter if the input is in range 0-1 or 0-255?

@phixerino
Copy link
Author

With input_quant export to FINN doesnt work:

File /opt/conda/lib/python3.10/site-packages/brevitas/export/onnx/manager.py:121, in ONNXBaseManager.export_onnx(cls, module, args, export_path, input_shape, input_t, disable_warnings, **onnx_export_kwargs)
...
---> 30     assert not module.is_input_quant_enabled
     31     assert not module.is_output_quant_enabled
     32     if module.is_bias_quant_enabled:

AssertionError: 

So instead I used qnn.QuantIdentity(bit_width=first_layer_weight_bit_width, return_quant_tensor=True) and it works.

I'm using 4-bit to quantize my weight and activations, but as in the examples, I'm using 8-bit for the first and last layer. Now when I'm using 8-bit in the QuantIdentity layer, should I still use 8-bits in my first QuantConv2d layer?

Sorry for loads of questions, but I really appreciate the answers.

@fabianandresgrob
Copy link
Contributor

Thank you, it does help a lot. I'm guessing that the quantized input to the next QuantConv2d layer is done by QuantReLU, because it has return_quant_tensor=True. Is there any advantage to have return_quant_tensor=True also in the QuantConv2d layer?

Also if I set input_quant=Int8ActPerTensorFloat, does it matter if the input is in range 0-1 or 0-255?

In fact, it does make a difference. Cmp. this tutorial, especially cell 13 onwards. The results of a QuantConv layer with return_quant_tensor disabled and followed by a QuantReLU will slightly be different from QuantConv with return_quant_tensor enabled followed by a QuantReLU. This is because one usually quantizes the output of the Conv layer with an 8-bit signed quantizer. However, ReLU can exploit unsigned quantization, as the output will be > 0 anyways. So if you quantize the output of the conv layer with a signed quantizer and then apply the QuantReLU, you lose half of the range as we strip all negative values. The output will then be re-quantized using an unsigned quantizer. If we had used 8 bits for both, we basically lose 1 bit, as the conv output would be in range -127 to 127, after ReLU it would be 0 to 127. But an unsigned 8-bit quantizer could have used the range 0 to 255.

Similarly, applying quantization to the range 0-1 vs. 0-255 makes a difference. I'd suggest go with the usual 0-1 range for the input data.

With input_quant export to FINN doesnt work:

File /opt/conda/lib/python3.10/site-packages/brevitas/export/onnx/manager.py:121, in ONNXBaseManager.export_onnx(cls, module, args, export_path, input_shape, input_t, disable_warnings, **onnx_export_kwargs)
...
---> 30     assert not module.is_input_quant_enabled
     31     assert not module.is_output_quant_enabled
     32     if module.is_bias_quant_enabled:

AssertionError: 

So instead I used qnn.QuantIdentity(bit_width=first_layer_weight_bit_width, return_quant_tensor=True) and it works.

I'm using 4-bit to quantize my weight and activations, but as in the examples, I'm using 8-bit for the first and last layer. Now when I'm using 8-bit in the QuantIdentity layer, should I still use 8-bits in my first QuantConv2d layer?

Sorry for loads of questions, but I really appreciate the answers.

Yes, the QuantIdentity quantizes the input for the first layer, however, it does not quantize the weights of the first QuantConv2d to 8 bits. So if you want to quantize your first conv layer to 8 bits, you need to use 8-bits for that layer.

No worries :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants