Skip to content

Conversation

GregoryComer
Copy link
Member

@GregoryComer GregoryComer commented Oct 11, 2025

Summary

We're seeing crashes on Android when running XNNPACK-delegated models. I tracked it down to a bug in the alignment calculation for weight cache memory. To make the calculation, it casts the void* to a (signed) intptr_t. When the address is in the upper half of the address space, it becomes negative. This causes the modulo to return a negative value and increment the address too much - leading to out of bounds access.

void* maybe_aligned_space = data_container.data();
void* aligned_space = (void*)((intptr_t)maybe_aligned_space + 64 -
(intptr_t)maybe_aligned_space % 64);

Walking through the numbers I captured in #14831:

  • The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90.
  • The target alignment is 64 bytes.
  • Casting the address to intptr_t gives -5476376639047992688.
    • Mod 64 is -48.
    • The total offset applied is 64 - (-48) = 112.
  • Since the allocation size is N + 64, increasing the start by 112 means the new region extends 48 bytes past the end of the allocation.

To resolve this, I replaced the alignment code with a call to std::align. Casing to uintptr_t also resolves it, but using the standard implementation seems less error prone.

Test plan

I've validated that the repro in #14831 does not crash with this change.

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15039

Note: Links to docs will display an error until the docs builds have been completed.

❗ 2 Active SEVs

There are 2 currently active SEVs. If your PR is affected, please view them below:

❌ 5 New Failures, 10 Pending

As of commit bb7a836 with merge base 019c8da (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 11, 2025
@GregoryComer GregoryComer force-pushed the fix-weight-cache-align branch from d3e49c7 to 15097e3 Compare October 11, 2025 22:38
@GregoryComer
Copy link
Member Author

CI failures are due to running on a fork or broken trunk.

@GregoryComer GregoryComer marked this pull request as ready for review October 11, 2025 23:49
@GregoryComer GregoryComer added the release notes: none Do not include this in the release notes label Oct 12, 2025
@mergennachin
Copy link
Contributor

This is great, thank you @GregoryComer

Is it possible to write a regression test?

@mergennachin mergennachin self-requested a review October 12, 2025 18:27
@GregoryComer GregoryComer force-pushed the fix-weight-cache-align branch 2 times, most recently from b8d7fb0 to 52e24e1 Compare October 13, 2025 05:47
context->packed_pointer_to_container_[aligned_space] =
std::move(data_container);
return aligned_space;
} catch (std::exception& e) {
Copy link
Contributor

@mergennachin mergennachin Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any particular exception type you're expecting? usually not a good practice to catch base exception, recommend specifying the exception type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use try..catch in the first place?

Copy link
Member Author

@GregoryComer GregoryComer Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is here because Mergen requested asserting non-null output from std::align (it returns null on error if it can't align to the requested params). We also want non-fatal OOM handling. XNNPACK can gracefully handle null weight pointers (at least from a quick glance).

Since asserts are fatal, I needed to distinguish between null in the context of memory allocation failed vs null due to std::align failing. String wont return null data, but since I'm adding the assert, I want to make it robust for if we swap to using the ET allocator later. String can throw both bad_alloc and length_error, which both should be treated as an allocation failure. I'm not entirely sure why the code is using a string over a unique_ptr<uint8_t[]> or similar, but I wanted to keep the changes minimal. I'm open to suggestions.

@digantdesai
Copy link
Contributor

it casts the void* to a (signed) intptr_t.

sigh

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamping to unblock, let's cp in 1.0

@GregoryComer
Copy link
Member Author

Is it possible to write a regression test?

This is a bit tricky as reproing the issue as is requires the allocation to be in the upper half of the address space. We do cover this code in CI in a number of places, but not on the right platforms. We could refactor it a bunch to allow a mock allocator in, but this is largely just testing the alignment logic. Since it's using std::align now, I don't necessarily see a ton of value in that.

I think the best takeaway would be to run more end to end model tests on Android (maybe via emulator?) in CI. That likely would have caught the bug, given that we see it pretty reliably on the Android demo app.

@GregoryComer GregoryComer force-pushed the fix-weight-cache-align branch from 085298e to 8e91a58 Compare October 13, 2025 21:17
@GregoryComer
Copy link
Member Author

I'm re-testing locally and in CI after rebasing and minor cleanup. Hitting some seemingly unrelated errors with class loading of the Android LLM extension. Will land once I figure this out if everything looks good. Hopefully within the next hour or two.

@GregoryComer GregoryComer force-pushed the fix-weight-cache-align branch from 8e91a58 to ddcaa8f Compare October 13, 2025 22:49
@kirklandsign
Copy link
Contributor

Hi @GregoryComer what is the android error? Is it consistent?

@GregoryComer
Copy link
Member Author

GregoryComer commented Oct 13, 2025

Hi @GregoryComer what is the android error? Is it consistent?

Here's what I'm seeing. Any ideas?

Abort message: 'JNI DETECTED ERROR IN APPLICATION: JNI NewGlobalRef called with pending exception java.lang.NoSuchMethodError: no static or non-static method "Lorg/pytorch/executorch/extension/llm/LlmModule;.initHybrid(ILjava/lang/String;Ljava/lang/String;FLjava/lang/Object;)Lcom/facebook/jni/HybridData;"

This started after I rebased and switched branches, as well as closed and re-setup my terminal env, so I'm guessing it's on my end. CI seems to pass. I'm testing on the base commit now.

@kirklandsign
Copy link
Contributor

Hi @GregoryComer what is the android error? Is it consistent?

Here's what I'm seeing. Any ideas?

Abort message: 'JNI DETECTED ERROR IN APPLICATION: JNI NewGlobalRef called with pending exception java.lang.NoSuchMethodError: no static or non-static method "Lorg/pytorch/executorch/extension/llm/LlmModule;.initHybrid(ILjava/lang/String;Ljava/lang/String;FLjava/lang/Object;)Lcom/facebook/jni/HybridData;"

This started after I rebased and switched branches, as well as closed and re-setup my terminal env, so I'm guessing it's on my end. CI seems to pass. I'm testing on the base commit now.

Does #15067 fix the issue?

@GregoryComer GregoryComer force-pushed the fix-weight-cache-align branch from ddcaa8f to bb7a836 Compare October 14, 2025 00:56
@GregoryComer
Copy link
Member Author

GregoryComer commented Oct 14, 2025

Hi @GregoryComer what is the android error? Is it consistent?

Here's what I'm seeing. Any ideas?

Abort message: 'JNI DETECTED ERROR IN APPLICATION: JNI NewGlobalRef called with pending exception java.lang.NoSuchMethodError: no static or non-static method "Lorg/pytorch/executorch/extension/llm/LlmModule;.initHybrid(ILjava/lang/String;Ljava/lang/String;FLjava/lang/Object;)Lcom/facebook/jni/HybridData;"

This started after I rebased and switched branches, as well as closed and re-setup my terminal env, so I'm guessing it's on my end. CI seems to pass. I'm testing on the base commit now.

Does #15067 fix the issue?

I'm still seeing it after rebasing onto the latest main. Since CI is clean, I'm hopeful that it's a local issue. I'm going to merge, pick, and build RC3 and validate there.

@GregoryComer GregoryComer merged commit 7421646 into pytorch:main Oct 14, 2025
131 of 137 checks passed
@GregoryComer
Copy link
Member Author

@pytorchbot cherrypick --onto release/1.0 -c critical

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 14, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'cherrypick' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@GregoryComer
Copy link
Member Author

@pytorchbot cherry-pick --onto release/1.0 -c critical

pytorchbot pushed a commit that referenced this pull request Oct 14, 2025
### Summary
We're seeing crashes on Android when running XNNPACK-delegated models. I
tracked it down to a bug in the alignment calculation for weight cache
memory. To make the calculation, it casts the void* to a (signed)
intptr_t. When the address is in the upper half of the address space, it
becomes negative. This causes the modulo to return a negative value and
increment the address too much - leading to out of bounds access.

https://github.com/pytorch/executorch/blob/cc6cb837d6ac92f52a2d30a405900caf115f0556/backends/xnnpack/runtime/XNNWeightsCache.cpp#L166-L168

Walking through the numbers I captured in
#14831:
* The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90.
* The target alignment is 64 bytes.
* Casting the address to intptr_t gives -5476376639047992688.
  * Mod 64 is -48.
  * The total offset applied  is 64 - (-48) = 112.
* Since the allocation size is N + 64, increasing the start by 112 means
the new region extends 48 bytes past the end of the allocation.

To resolve this, I replaced the alignment code with a call to
std::align. Casing to uintptr_t also resolves it, but using the standard
implementation seems less error prone.

### Test plan
I've validated that the repro in
#14831 does not crash with
this change.

(cherry picked from commit 7421646)
@pytorchbot
Copy link
Collaborator

Cherry picking #15039

The cherry pick PR is at #15090 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

Gasoonjia pushed a commit that referenced this pull request Oct 14, 2025
### Summary
We're seeing crashes on Android when running XNNPACK-delegated models. I
tracked it down to a bug in the alignment calculation for weight cache
memory. To make the calculation, it casts the void* to a (signed)
intptr_t. When the address is in the upper half of the address space, it
becomes negative. This causes the modulo to return a negative value and
increment the address too much - leading to out of bounds access.


https://github.com/pytorch/executorch/blob/cc6cb837d6ac92f52a2d30a405900caf115f0556/backends/xnnpack/runtime/XNNWeightsCache.cpp#L166-L168

Walking through the numbers I captured in
#14831:
* The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90.
* The target alignment is 64 bytes.
* Casting the address to intptr_t gives -5476376639047992688.
  * Mod 64 is -48.
  * The total offset applied  is 64 - (-48) = 112.
* Since the allocation size is N + 64, increasing the start by 112 means
the new region extends 48 bytes past the end of the allocation.

To resolve this, I replaced the alignment code with a call to
std::align. Casing to uintptr_t also resolves it, but using the standard
implementation seems less error prone.

### Test plan
I've validated that the repro in
#14831 does not crash with
this change.
@kirklandsign
Copy link
Contributor

Hi @GregoryComer what device did you validate?

@GregoryComer
Copy link
Member Author

GregoryComer commented Oct 14, 2025

Hi @GregoryComer what device did you validate?

I validated on an emulated Pixel 9 from an M1 host.

Edit: I tried the RC3 AAR on the same emulator instance and it works - no segfault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants