Skip to content

Conversation

@AdrianLundell
Copy link
Collaborator

@AdrianLundell AdrianLundell commented Nov 3, 2025

Reuses the FoldAndAnnotateQParamsPass from the Arm backend to greatly simplify the logic for fusing the ops.

Additionally updates the linear kernel to be numerically correct and computes the kernel_sum aot in the quantized_linear_fusion pass. Note that since this replaces the bias node it typically causes no extra memory usage.

Updates the Linear tests to mirror this, including removing the various matmul tests. Since the linear is handled as a separate op rather than a particular type of matmul these tests are not related anymore.

Removes unnecessary stub definitions in operators.py, operators.yaml and op_quantized_linear.cpp

Leaving a few TODO:s since the patch is large already.

cc @freddan80 @per @zingo @oscarandersson8218 @digantdesai

Reuses the FoldAndAnnotateQParamsPass from the Arm backend
to greatly simplify the logic for fusing the ops.

Additionally updates the linear kernel to be numerically correct
and computes the kernel_sum aot in the quantized_linear_fusion pass.
Note that since this replaces the bias node it typically causes no
extra memory usage.

Updates the Linear tests to mirror this, including removing the
various matmul tests. Since the linear is handled as a separate op
rather than a particular type of matmul these tests are not related
anymore.

Removes unnecessary stub definitions in operators.py, operators.yaml and
op_quantized_linear.cpp

Leaving a few TODO:s since the patch is large already.

Signed-off-by: Adrian Lundell <[email protected]>
Change-Id: I194228ee3ae4b64a92f3f818afb2e045cc3acf91
@AdrianLundell AdrianLundell requested a review from psiddh November 3, 2025 15:59
@AdrianLundell AdrianLundell added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: none Do not include this in the release notes labels Nov 3, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 3, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15526

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 9b8cdf2 with merge base 6ab8723 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 3, 2025
@psiddh
Copy link
Contributor

psiddh commented Nov 4, 2025

@AdrianLundell Just want to better understand the rationale behind removing per channel support / scratch buffer code and in general stateful ops related code ? (I do understand the FoldDQQ related changes to the pass though), (edited)

@AdrianLundell
Copy link
Collaborator Author

The implementation didn't produce numerically results originally so I wanted to fix that before adding per-channel support. Not doing that in this patch was mostly to keep the patch size down and prioritizing the most important functionality before going into details.

Regarding the stateful scratch buffer, since we compute the kernel sum which the buffer was used for AOT and replace the bias, there is no reason to do that in the runtime as I see it. The only use-case I see where that would make sense (using the MVE implementation) is if you are very tight on memory and prefer to spend some extra time during each interference to compute the kernel sum in a scratch buffer rather than keeping it in memory. But then again there is no reason to do everything in one single patch, better to get a good minimal working example running IMO.

@psiddh
Copy link
Contributor

psiddh commented Nov 4, 2025

The implementation didn't produce numerically results originally so I wanted to fix that before adding per-channel support. Not doing that in this patch was mostly to keep the patch size down and prioritizing the most important functionality before going into details.

Regarding the stateful scratch buffer, since we compute the kernel sum which the buffer was used for AOT and replace the bias, there is no reason to do that in the runtime as I see it. The only use-case I see where that would make sense (using the MVE implementation) is if you are very tight on memory and prefer to spend some extra time during each interference to compute the kernel sum in a scratch buffer rather than keeping it in memory. But then again there is no reason to do everything in one single patch, better to get a good minimal working example running IMO.

  • Yes, we need scratch buffer for DSP & MVE (Helium) implementations, I think
  • Given that, I would suggest, to keep CMSISScratchBufferContext file as-is , for now. The idea was to evolve this into a generic ScratchBuffer impl that could be reused for couple of diff optimization scenarios
  • How are you testing these changes ? E2E on FVP (or) some real device for numerical results. Can you share the steps please ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants