Skip to content

[PT2E] observers do not handle inputs with different shapes correctly #2112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Xia-Weiwen opened this issue Apr 23, 2025 · 20 comments
Open
Assignees
Labels
pt2e_quant pt2 export quantization triaged

Comments

@Xia-Weiwen
Copy link
Collaborator

Summary
The observers in Torchao, such as AffineQuantizedMinMaxObserver, support different quantization granularity by keeping block sizes. For example, if granularity is PerTensor and input shape is (16, 3, 224, 224), the observer keeps block sizes = (16, 3, 224, 224).
However, the block sizes would be wrong if inputs with different shapes are passed in. For example, if another input with shape = (16, 3, 56, 56) comes, the block sizes are updated as (16, 3, 56, 56). It is wrong for inputs with shape = (16, 3, 224, 224).

How to reproduce
The code to reproduce is similar as the reproducer here: #2094
One needs to bypass the #2094 issue by changing source code manually and also run the converted_model after convert_pt2e:

    converted_model = convert_pt2e(prepared_model)
    move_exported_model_to_eval(converted_model)
    print("[info] converted_model =\n", converted_model)
    converted_model(*example_inputs)  # add this line
@Xia-Weiwen Xia-Weiwen changed the title [PT2E] observers does not handle inputs with different shapes correctly [PT2E] observers do not handle inputs with different shapes correctly Apr 23, 2025
@jerryzh168
Copy link
Contributor

yeah we haven't test per tensor that much I think, we'd need to fix per tensor use case

@jerryzh168
Copy link
Contributor

I have thought about this scenario a bit before, I think the fix for this one would be using granularity instead of block_size in the observer, please feel free to take this if you have time @Xia-Weiwen

@Xia-Weiwen
Copy link
Collaborator Author

Do you mean using granularity in all cases or only for per-tensor?

@Xia-Weiwen
Copy link
Collaborator Author

I think using block size = -1 for per-tensor also makes sense. What do you think?

@jerryzh168
Copy link
Contributor

Do you mean using granularity in all cases or only for per-tensor?

just for per tensor for now, we can assert that only one of granularity and block_size can be valid, and use the one that's available

@jerryzh168
Copy link
Contributor

I think using block size = -1 for per-tensor also makes sense. What do you think?

yeah this could make sense, but then the question is can we express all dynamic shapes with block_size?

@Xia-Weiwen
Copy link
Collaborator Author

I think so. For a certain dimension:

  • If group size > 1 is specified, we just use that group size as block size
  • If group size == 1, such as per-channel and per token, we use block size = 1
  • If it's not group-wise quantization, such as per-tensor and the other dimension of per-channel, we use block size = -1

All cases can be covered. How does that sound?

@jerryzh168
Copy link
Contributor

I think so. For a certain dimension:

  • If group size > 1 is specified, we just use that group size as block size
  • If group size == 1, such as per-channel and per token, we use block size = 1
  • If it's not group-wise quantization, such as per-tensor and the other dimension of per-channel, we use block size = -1

All cases can be covered. How does that sound?

I remember discussed this with @drisspg and @vkuzo before. there is another dimension which is the number of dims (rank of the tensor), e.g. for per token/per row, it is actually: all the preceding dims except for last dim should use size 1, except for the last dim:

elif isinstance(granularity, PerRow):
, e.g. for 2-dim tensor, block_size will be (1, shape[-1]) for 3-dim tnesor block_size will be (1, 1, shape[-1])

I'm not sure if we will have a input tensor that has variable rank though

So in theory this is not enough, but maybe it's enough in practice?

@Xia-Weiwen
Copy link
Collaborator Author

I see. Yeah maybe an observer would not see multiple ranks in practice but I am not sure. Would you have more discussions and decide what to do? Thanks.

@jerryzh168
Copy link
Contributor

jerryzh168 commented Apr 24, 2025

yeah we'll discuss in our team meeting tomorrow to decide on what to do

@jerryzh168
Copy link
Contributor

@Xia-Weiwen we discussed in the meeting and we feel that adding -1 to represent dynamic size makes sense, and since we don't anticipate seeing cases that has dynamic dimension size (rank) as well, we can ignore that for now. do you want to start adding support for -1 for per tensor quantization first

@jerryzh168 jerryzh168 added the pt2e_quant pt2 export quantization label Apr 29, 2025
@Xia-Weiwen
Copy link
Collaborator Author

@Xia-Weiwen we discussed in the meeting and we feel that adding -1 to represent dynamic size makes sense, and since we don't anticipate seeing cases that has dynamic dimension size (rank) as well, we can ignore that for now.

Great

do you want to start adding support for -1 for per tensor quantization first

Sure, I will work on that.

BTW, sounds like it can also fix the issue #2094

@Xia-Weiwen Xia-Weiwen self-assigned this Apr 29, 2025
@jerryzh168 jerryzh168 removed their assignment Apr 29, 2025
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 We do see dynamic dimension size when running Resnet18. So, block_size = -1 does not work. Do you want to use granularity instead of block sizes for per-tensor quantization? Thanks.

@jerryzh168
Copy link
Contributor

Hi @jerryzh168 We do see dynamic dimension size when running Resnet18. So, block_size = -1 does not work. Do you want to use granularity instead of block sizes for per-tensor quantization? Thanks.

I see, if it's just per tensor, I think we can use -1 to represent it

@drisspg
Copy link
Contributor

drisspg commented May 8, 2025

The discussion was that dynamic rank is less common, and in that case -1 will likely not work. But most common quantization schemes w/ varying sizes in fixed rank tensors can be expressed w/ -1 + integer

@Xia-Weiwen
Copy link
Collaborator Author

The discussion was that dynamic rank is less common, and in that case -1 will likely not work. But most common quantization schemes w/ varying sizes in fixed rank tensors can be expressed w/ -1 + integer

Thanks for your comment. Dynamic rank was found when running Resnet18, which is a classic CNN model. And the same issue may exit with all Resnet-like models. May I know what the conclusion is to cover all cases? And what does "-1 + integer" mean? Thanks.

@jerryzh168
Copy link
Contributor

jerryzh168 commented May 8, 2025

@Xia-Weiwen for the case you found in resnet18, is the dynamic dimension tensor only quantized with per tensor quantization? if so I think we can use a single -1 number to represent it: both dynamic rank and dynamic dim, in other cases, we will only allow fixed rank and dynamic dim. So for example:

Per Tensor: -1
Per Group: (1, 128)
Per Axis (axis = 2): (m, n, 1)
Per Axis (axis = 2 and dim 0 is dynamic): (-1, n, 1)

Would this work?

@Xia-Weiwen
Copy link
Collaborator Author

@jerryzh168 Thanks for the suggestion. As I asked here #2177 (comment), it requires bypassing checks of dimensions for per-tensor quantization. Is it OK for you?

@jerryzh168
Copy link
Contributor

yeah I think that is fine, we need to adapt our quant primitives to work with -1 semantics overall

@Xia-Weiwen
Copy link
Collaborator Author

Ok. I have updated the PR here: #2177. Now we use [-1] instead of integer -1 for per-tensor quant due to some grammar checks. Is it ok? Please review the PR. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pt2e_quant pt2 export quantization triaged
Projects
None yet
Development

No branches or pull requests

3 participants