Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubt about the __add__ implementation of the IntQuantTensor #1106

Closed
balditommaso opened this issue Dec 3, 2024 · 7 comments
Closed

Doubt about the __add__ implementation of the IntQuantTensor #1106

balditommaso opened this issue Dec 3, 2024 · 7 comments

Comments

@balditommaso
Copy link

Hi, I was looking to the new implementation of the __add__ in the IntQuantTensor class and I am not sure that you are handling correctly the case where we add QuantTensor to a Tensor. Indeed, you are adjusting the zero_point in the following way:

zero_point=self.zero_point - _unpack_quant_tensor(other) / self.scale,

By doing so you are basically creating a zero_point value for each weight of the tensor and this is not good for the following reasons:

  1. From a memory point of view, we are doubling the size of the tensor, because the .zero_point and .value tensors will have the same shape.
  2. From a theoretical point of view, we have layer-wise, where scale and zero_point are scalars, and channel-wise quantization, where scale and zero_point are tensors length as the number of input channels. I am not sure the weight-wise quantization exists and if it can be easly handled in HW.

I think it might be better to handle that situation dequantizing the QuantTensor and returning the sum of two Tensors.

Let me know what do you think!

Screenshot 2024-12-03 alle 11 33 31

@Giuseppe5
Copy link
Collaborator

My understanding from your message is that what we currently do in Brevitas is not technically wrong but maybe it's not the best handling of this case, correct?

In Brevitas, we tend to preserve and propagate the concept of QuantTensor as much as possible, as in the case you pointed out, since it's always possible to fall back to a normal tensor by calling output_quant_tensor.value. The opposite however is not possible; once you fall back to a Tensor, you can't easily reconstruct the QuantTensor it came from.

In the future, we might think about introducing a flag that let the user select one behavior or another, but for now I would recommend to manually dequantize your tensor after the addition.

Going back to the question at the beginning of my answer, if I miss-understood and the results is technically wrong, let us know and we'll promptly fix that.

Thanks!

@balditommaso
Copy link
Author

I see your point, but maybe there should be a warning for the user, so he/she can be aware of this behavior which is not "canonical" for the quantization theory. I discover it by chance, because I was surprised by the size of the model. Indeed, imagine moving this implementation on HW where for each weight (ex. INT8) we need its zero_point (ex. FP32) value, you are wasting a lot of memory.

I am not sure there is a correct way to handle this case, it's tricky, however thank you for your answer, now I know what you were looking for with this approach!

@Giuseppe5
Copy link
Collaborator

Giuseppe5 commented Dec 3, 2024

Can you provide a small script with an example of this and its impact on the size of the model?

I imagine it has an impact on the runtime memory you might need to run the model (which is removed once you dequantize) but I'm not sure I understand how it would impact the "static" size of your model.

Maybe I am not understanding correctly, so a small example could clarify that.

@balditommaso
Copy link
Author

with this script you will see that the number of zero_point is the same of weights,
so from a memory point of view your passing from O(n + 2) to O(2n + 1), where n is the number of weights. It can be a lot base on the use case.

import torch
import torch.nn as nn
import brevitas.nn as qnn
from brevitas.quant.scaled_int import Int8WeightPerTensorFloat, Uint8ActPerTensorFloat, Int8ActPerTensorFloat

quant = qnn.QuantIdentity(Int8ActPerTensorFloat, return_quant_tensor=True)

x = torch.randn((100, 10))
x = quant(x)
print(x.value.numel())
print(x.zero_point.numel())
x = x + torch.ones_like(x)
print(x.value.numel())
print(x.zero_point.numel())

@Giuseppe5
Copy link
Collaborator

Right, but the size of the state_dict() is not affected by how we treat the addition.

And to avoid the issue, as mentioned above, you could simply change

x = x + torch.ones_like(x)

to

x = x.value + torch.ones_like(x)

I understand your point and as mentioned, we are thinking about a clean solution to allow for both solutions to co-exist.

I just wanted to make sure that there were no side effects that would cause an increase of size of the model state_dict

@balditommaso
Copy link
Author

The state_dict is not impacted, but the QONNX which I am going to use to deploy my model on FPGA will have a lot of zero_points, however I see your point, if you are quantizing the model to deploy it on edge devices you shouldn't do that.

@Giuseppe5
Copy link
Collaborator

I think we are in agreement, so I am going to close this issue for now, but feel free to re-open if you have further issues with this.

Thanks for pointing this out and the example, I am sure it will be useful for other Brevitas' users as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants