Fix alignment calculation in XNNWeightsCache #15039

GregoryComer · 2025-10-11T22:20:32Z

Summary

We're seeing crashes on Android when running XNNPACK-delegated models. I tracked it down to a bug in the alignment calculation for weight cache memory. To make the calculation, it casts the void* to a (signed) intptr_t. When the address is in the upper half of the address space, it becomes negative. This causes the modulo to return a negative value and increment the address too much - leading to out of bounds access.

executorch/backends/xnnpack/runtime/XNNWeightsCache.cpp

Lines 166 to 168 in cc6cb83

    
           void* maybe_aligned_space = data_container.data(); 
        
           void* aligned_space = (void*)((intptr_t)maybe_aligned_space + 64 - 
        
                                         (intptr_t)maybe_aligned_space % 64);

Walking through the numbers I captured in #14831:

The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90.
The target alignment is 64 bytes.
Casting the address to intptr_t gives -5476376639047992688.
- Mod 64 is -48.
- The total offset applied is 64 - (-48) = 112.
Since the allocation size is N + 64, increasing the start by 112 means the new region extends 48 bytes past the end of the allocation.

To resolve this, I replaced the alignment code with a call to std::align. Casing to uintptr_t also resolves it, but using the standard implementation seems less error prone.

Test plan

I've validated that the repro in #14831 does not crash with this change.

pytorch-bot · 2025-10-11T22:20:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15039

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 5 New Failures, 10 Pending

As of commit bb7a836 with merge base 019c8da ():

NEW FAILURES - The following jobs have failed:

pull / test-qnn-wheel-packages-linux (3.10) / linux-job (gh)
RuntimeError: Command docker exec -t 90ddff479a1de8abc167467eb9ce9f2733d9ee439ab6bb4c2ce6c57675e84d40 /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.11) / linux-job (gh)
RuntimeError: Command docker exec -t d3a3070086255fa7d7e28c81b7bb9ac17e7b57aea32a170d37b44308f547a99f /exec failed with exit code 1
pull / test-qnn-wheel-packages-linux (3.12) / linux-job (gh)
RuntimeError: Command docker exec -t f6f6dbfcf8d8859db3bbaac080cd90cf3a029cf5b21d6824fb5c7d94289075db /exec failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t b0d1a933d8658971b28d2c1bf5b6773a7e9d2f4bb10e763f101a20c9a2046019 /exec failed with exit code 1
Test CUDA Builds / export-voxtral-cuda-artifact / linux-job (gh)
RuntimeError: Command docker exec -t 1f13b16dd3e260e052513f06b939c10df737ad3d4b6a869e71a46918d596adb9 /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

GregoryComer · 2025-10-11T23:49:12Z

CI failures are due to running on a fork or broken trunk.

mergennachin · 2025-10-12T18:27:11Z

This is great, thank you @GregoryComer

Is it possible to write a regression test?

backends/xnnpack/runtime/XNNWeightsCache.cpp

mergennachin · 2025-10-13T11:40:16Z

backends/xnnpack/runtime/XNNWeightsCache.cpp

+    context->packed_pointer_to_container_[aligned_space] =
+        std::move(data_container);
+    return aligned_space;
+  } catch (std::exception& e) {


is there any particular exception type you're expecting? usually not a good practice to catch base exception, recommend specifying the exception type

Why do we need to use try..catch in the first place?

This is here because Mergen requested asserting non-null output from std::align (it returns null on error if it can't align to the requested params). We also want non-fatal OOM handling. XNNPACK can gracefully handle null weight pointers (at least from a quick glance).

Since asserts are fatal, I needed to distinguish between null in the context of memory allocation failed vs null due to std::align failing. String wont return null data, but since I'm adding the assert, I want to make it robust for if we swap to using the ET allocator later. String can throw both bad_alloc and length_error, which both should be treated as an allocation failure. I'm not entirely sure why the code is using a string over a unique_ptr<uint8_t[]> or similar, but I wanted to keep the changes minimal. I'm open to suggestions.

digantdesai · 2025-10-13T17:24:26Z

it casts the void* to a (signed) intptr_t.

sigh

digantdesai

stamping to unblock, let's cp in 1.0

backends/xnnpack/runtime/XNNWeightsCache.cpp

GregoryComer · 2025-10-13T18:28:04Z

Is it possible to write a regression test?

This is a bit tricky as reproing the issue as is requires the allocation to be in the upper half of the address space. We do cover this code in CI in a number of places, but not on the right platforms. We could refactor it a bunch to allow a mock allocator in, but this is largely just testing the alignment logic. Since it's using std::align now, I don't necessarily see a ton of value in that.

I think the best takeaway would be to run more end to end model tests on Android (maybe via emulator?) in CI. That likely would have caught the bug, given that we see it pretty reliably on the Android demo app.

extension/android/CMakeLists.txt

GregoryComer · 2025-10-13T21:18:34Z

I'm re-testing locally and in CI after rebasing and minor cleanup. Hitting some seemingly unrelated errors with class loading of the Android LLM extension. Will land once I figure this out if everything looks good. Hopefully within the next hour or two.

kirklandsign · 2025-10-13T23:08:19Z

Hi @GregoryComer what is the android error? Is it consistent?

GregoryComer · 2025-10-13T23:13:19Z

Hi @GregoryComer what is the android error? Is it consistent?

Here's what I'm seeing. Any ideas?

Abort message: 'JNI DETECTED ERROR IN APPLICATION: JNI NewGlobalRef called with pending exception java.lang.NoSuchMethodError: no static or non-static method "Lorg/pytorch/executorch/extension/llm/LlmModule;.initHybrid(ILjava/lang/String;Ljava/lang/String;FLjava/lang/Object;)Lcom/facebook/jni/HybridData;"

This started after I rebased and switched branches, as well as closed and re-setup my terminal env, so I'm guessing it's on my end. CI seems to pass. I'm testing on the base commit now.

kirklandsign · 2025-10-13T23:19:42Z

Hi @GregoryComer what is the android error? Is it consistent?

Here's what I'm seeing. Any ideas?
Abort message: 'JNI DETECTED ERROR IN APPLICATION: JNI NewGlobalRef called with pending exception java.lang.NoSuchMethodError: no static or non-static method "Lorg/pytorch/executorch/extension/llm/LlmModule;.initHybrid(ILjava/lang/String;Ljava/lang/String;FLjava/lang/Object;)Lcom/facebook/jni/HybridData;"
This started after I rebased and switched branches, as well as closed and re-setup my terminal env, so I'm guessing it's on my end. CI seems to pass. I'm testing on the base commit now.

Does #15067 fix the issue?

GregoryComer · 2025-10-14T02:13:53Z

Hi @GregoryComer what is the android error? Is it consistent?

Here's what I'm seeing. Any ideas?
Abort message: 'JNI DETECTED ERROR IN APPLICATION: JNI NewGlobalRef called with pending exception java.lang.NoSuchMethodError: no static or non-static method "Lorg/pytorch/executorch/extension/llm/LlmModule;.initHybrid(ILjava/lang/String;Ljava/lang/String;FLjava/lang/Object;)Lcom/facebook/jni/HybridData;"
This started after I rebased and switched branches, as well as closed and re-setup my terminal env, so I'm guessing it's on my end. CI seems to pass. I'm testing on the base commit now.
Does #15067 fix the issue?

I'm still seeing it after rebasing onto the latest main. Since CI is clean, I'm hopeful that it's a local issue. I'm going to merge, pick, and build RC3 and validate there.

GregoryComer · 2025-10-14T02:14:22Z

@pytorchbot cherrypick --onto release/1.0 -c critical

pytorch-bot · 2025-10-14T02:14:25Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'cherrypick' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

GregoryComer · 2025-10-14T02:15:37Z

@pytorchbot cherry-pick --onto release/1.0 -c critical

### Summary We're seeing crashes on Android when running XNNPACK-delegated models. I tracked it down to a bug in the alignment calculation for weight cache memory. To make the calculation, it casts the void* to a (signed) intptr_t. When the address is in the upper half of the address space, it becomes negative. This causes the modulo to return a negative value and increment the address too much - leading to out of bounds access. https://github.com/pytorch/executorch/blob/cc6cb837d6ac92f52a2d30a405900caf115f0556/backends/xnnpack/runtime/XNNWeightsCache.cpp#L166-L168 Walking through the numbers I captured in #14831: * The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90. * The target alignment is 64 bytes. * Casting the address to intptr_t gives -5476376639047992688. * Mod 64 is -48. * The total offset applied is 64 - (-48) = 112. * Since the allocation size is N + 64, increasing the start by 112 means the new region extends 48 bytes past the end of the allocation. To resolve this, I replaced the alignment code with a call to std::align. Casing to uintptr_t also resolves it, but using the standard implementation seems less error prone. ### Test plan I've validated that the repro in #14831 does not crash with this change. (cherry picked from commit 7421646)

pytorchbot · 2025-10-14T02:18:10Z

Cherry picking #15039

The cherry pick PR is at #15090 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v1.0.0] Release Tracker #14288 (comment)

Details for Dev Infra team

Raised by workflow job

### Summary We're seeing crashes on Android when running XNNPACK-delegated models. I tracked it down to a bug in the alignment calculation for weight cache memory. To make the calculation, it casts the void* to a (signed) intptr_t. When the address is in the upper half of the address space, it becomes negative. This causes the modulo to return a negative value and increment the address too much - leading to out of bounds access. https://github.com/pytorch/executorch/blob/cc6cb837d6ac92f52a2d30a405900caf115f0556/backends/xnnpack/runtime/XNNWeightsCache.cpp#L166-L168 Walking through the numbers I captured in #14831: * The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90. * The target alignment is 64 bytes. * Casting the address to intptr_t gives -5476376639047992688. * Mod 64 is -48. * The total offset applied is 64 - (-48) = 112. * Since the allocation size is N + 64, increasing the start by 112 means the new region extends 48 bytes past the end of the allocation. To resolve this, I replaced the alignment code with a call to std::align. Casing to uintptr_t also resolves it, but using the standard implementation seems less error prone. ### Test plan I've validated that the repro in #14831 does not crash with this change.

kirklandsign · 2025-10-14T06:20:17Z

Hi @GregoryComer what device did you validate?

GregoryComer · 2025-10-14T18:53:14Z

Hi @GregoryComer what device did you validate?

I validated on an emulated Pixel 9 from an M1 host.

Edit: I tried the RC3 AAR on the same emulator instance and it works - no segfault.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 11, 2025

GregoryComer requested review from abhinaykukkadapu and psiddh October 11, 2025 22:20

GregoryComer force-pushed the fix-weight-cache-align branch from d3e49c7 to 15097e3 Compare October 11, 2025 22:38

GregoryComer marked this pull request as ready for review October 11, 2025 23:49

GregoryComer requested a review from digantdesai as a code owner October 11, 2025 23:49

GregoryComer added the release notes: none Do not include this in the release notes label Oct 12, 2025

GregoryComer requested a review from kimishpatel October 12, 2025 07:08

mergennachin self-requested a review October 12, 2025 18:27

mergennachin reviewed Oct 12, 2025

View reviewed changes

backends/xnnpack/runtime/XNNWeightsCache.cpp Outdated Show resolved Hide resolved

GregoryComer force-pushed the fix-weight-cache-align branch 2 times, most recently from b8d7fb0 to 52e24e1 Compare October 13, 2025 05:47

mergennachin reviewed Oct 13, 2025

View reviewed changes

digantdesai approved these changes Oct 13, 2025

View reviewed changes

psiddh reviewed Oct 13, 2025

View reviewed changes

backends/xnnpack/runtime/XNNWeightsCache.cpp Outdated Show resolved Hide resolved

GregoryComer force-pushed the fix-weight-cache-align branch from 52e24e1 to 085298e Compare October 13, 2025 21:06

GregoryComer requested review from kirklandsign and larryliu0820 as code owners October 13, 2025 21:06

mergennachin reviewed Oct 13, 2025

View reviewed changes

extension/android/CMakeLists.txt Show resolved Hide resolved

mergennachin approved these changes Oct 13, 2025

View reviewed changes

GregoryComer force-pushed the fix-weight-cache-align branch from 085298e to 8e91a58 Compare October 13, 2025 21:17

GregoryComer force-pushed the fix-weight-cache-align branch from 8e91a58 to ddcaa8f Compare October 13, 2025 22:49

Fix alignment calculation in XNNWeightsCache

bb7a836

GregoryComer force-pushed the fix-weight-cache-align branch from ddcaa8f to bb7a836 Compare October 14, 2025 00:56

GregoryComer merged commit 7421646 into pytorch:main Oct 14, 2025
131 of 137 checks passed

pytorchbot mentioned this pull request Oct 14, 2025

[v1.0.0] Release Tracker #14288

Open

	void* maybe_aligned_space = data_container.data();
	void* aligned_space = (void*)((intptr_t)maybe_aligned_space + 64 -
	(intptr_t)maybe_aligned_space % 64);

Fix alignment calculation in XNNWeightsCache #15039

Fix alignment calculation in XNNWeightsCache #15039

Uh oh!

Conversation

GregoryComer commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15039

❗ 2 Active SEVs

❌ 5 New Failures, 10 Pending

Uh oh!

GregoryComer commented Oct 11, 2025

Uh oh!

mergennachin commented Oct 12, 2025

Uh oh!

Uh oh!

mergennachin Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

digantdesai Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

GregoryComer Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

digantdesai commented Oct 13, 2025

Uh oh!

digantdesai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

GregoryComer commented Oct 13, 2025

Uh oh!

Uh oh!

GregoryComer commented Oct 13, 2025

Uh oh!

kirklandsign commented Oct 13, 2025

Uh oh!

GregoryComer commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kirklandsign commented Oct 13, 2025

Uh oh!

GregoryComer commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

GregoryComer commented Oct 14, 2025

Uh oh!

pytorch-bot bot commented Oct 14, 2025

Uh oh!

GregoryComer commented Oct 14, 2025

Uh oh!

pytorchbot commented Oct 14, 2025

Cherry picking #15039

Uh oh!

kirklandsign commented Oct 14, 2025

Uh oh!

GregoryComer commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

GregoryComer commented Oct 11, 2025 •

edited

Loading

pytorch-bot bot commented Oct 11, 2025 •

edited

Loading

mergennachin Oct 13, 2025 •

edited

Loading

GregoryComer Oct 13, 2025 •

edited

Loading

GregoryComer commented Oct 13, 2025 •

edited

Loading

GregoryComer commented Oct 14, 2025 •

edited

Loading

GregoryComer commented Oct 14, 2025 •

edited

Loading