Doubt about the add implementation of the IntQuantTensor #1106

balditommaso · 2024-12-03T10:34:00Z

Hi, I was looking to the new implementation of the __add__ in the IntQuantTensor class and I am not sure that you are handling correctly the case where we add QuantTensor to a Tensor. Indeed, you are adjusting the zero_point in the following way:

zero_point=self.zero_point - _unpack_quant_tensor(other) / self.scale,

By doing so you are basically creating a zero_point value for each weight of the tensor and this is not good for the following reasons:

From a memory point of view, we are doubling the size of the tensor, because the .zero_point and .value tensors will have the same shape.
From a theoretical point of view, we have layer-wise, where scale and zero_point are scalars, and channel-wise quantization, where scale and zero_point are tensors length as the number of input channels. I am not sure the weight-wise quantization exists and if it can be easly handled in HW.

I think it might be better to handle that situation dequantizing the QuantTensor and returning the sum of two Tensors.

Let me know what do you think!

The text was updated successfully, but these errors were encountered:

Giuseppe5 · 2024-12-03T12:38:49Z

My understanding from your message is that what we currently do in Brevitas is not technically wrong but maybe it's not the best handling of this case, correct?

In Brevitas, we tend to preserve and propagate the concept of QuantTensor as much as possible, as in the case you pointed out, since it's always possible to fall back to a normal tensor by calling output_quant_tensor.value. The opposite however is not possible; once you fall back to a Tensor, you can't easily reconstruct the QuantTensor it came from.

In the future, we might think about introducing a flag that let the user select one behavior or another, but for now I would recommend to manually dequantize your tensor after the addition.

Going back to the question at the beginning of my answer, if I miss-understood and the results is technically wrong, let us know and we'll promptly fix that.

Thanks!

balditommaso · 2024-12-03T13:10:44Z

I see your point, but maybe there should be a warning for the user, so he/she can be aware of this behavior which is not "canonical" for the quantization theory. I discover it by chance, because I was surprised by the size of the model. Indeed, imagine moving this implementation on HW where for each weight (ex. INT8) we need its zero_point (ex. FP32) value, you are wasting a lot of memory.

I am not sure there is a correct way to handle this case, it's tricky, however thank you for your answer, now I know what you were looking for with this approach!

Giuseppe5 · 2024-12-03T13:15:27Z

Can you provide a small script with an example of this and its impact on the size of the model?

I imagine it has an impact on the runtime memory you might need to run the model (which is removed once you dequantize) but I'm not sure I understand how it would impact the "static" size of your model.

Maybe I am not understanding correctly, so a small example could clarify that.

balditommaso · 2024-12-03T15:19:03Z

with this script you will see that the number of zero_point is the same of weights,
so from a memory point of view your passing from O(n + 2) to O(2n + 1), where n is the number of weights. It can be a lot base on the use case.

import torch
import torch.nn as nn
import brevitas.nn as qnn
from brevitas.quant.scaled_int import Int8WeightPerTensorFloat, Uint8ActPerTensorFloat, Int8ActPerTensorFloat

quant = qnn.QuantIdentity(Int8ActPerTensorFloat, return_quant_tensor=True)

x = torch.randn((100, 10))
x = quant(x)
print(x.value.numel())
print(x.zero_point.numel())
x = x + torch.ones_like(x)
print(x.value.numel())
print(x.zero_point.numel())

Giuseppe5 · 2024-12-03T15:23:26Z

Right, but the size of the state_dict() is not affected by how we treat the addition.

And to avoid the issue, as mentioned above, you could simply change

x = x + torch.ones_like(x)

to

x = x.value + torch.ones_like(x)

I understand your point and as mentioned, we are thinking about a clean solution to allow for both solutions to co-exist.

I just wanted to make sure that there were no side effects that would cause an increase of size of the model state_dict

balditommaso · 2024-12-03T15:31:39Z

The state_dict is not impacted, but the QONNX which I am going to use to deploy my model on FPGA will have a lot of zero_points, however I see your point, if you are quantizing the model to deploy it on edge devices you shouldn't do that.

Giuseppe5 · 2024-12-03T15:33:45Z

I think we are in agreement, so I am going to close this issue for now, but feel free to re-open if you have further issues with this.

Thanks for pointing this out and the example, I am sure it will be useful for other Brevitas' users as well!

Giuseppe5 closed this as completed Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubt about the add implementation of the IntQuantTensor #1106

Doubt about the add implementation of the IntQuantTensor #1106

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024 •

edited

Loading

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024

Doubt about the __add__ implementation of the IntQuantTensor #1106

Doubt about the __add__ implementation of the IntQuantTensor #1106

Comments

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024 • edited Loading

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024

balditommaso commented Dec 3, 2024

Giuseppe5 commented Dec 3, 2024

Doubt about the add implementation of the IntQuantTensor #1106

Doubt about the add implementation of the IntQuantTensor #1106

Giuseppe5 commented Dec 3, 2024 •

edited

Loading