-
Notifications
You must be signed in to change notification settings - Fork 694
Fix alignment calculation in XNNWeightsCache #15039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix alignment calculation in XNNWeightsCache #15039
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15039
Note: Links to docs will display an error until the docs builds have been completed. ❗ 2 Active SEVsThere are 2 currently active SEVs. If your PR is affected, please view them below:
❌ 5 New Failures, 10 PendingAs of commit bb7a836 with merge base 019c8da ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
d3e49c7
to
15097e3
Compare
CI failures are due to running on a fork or broken trunk. |
This is great, thank you @GregoryComer Is it possible to write a regression test? |
b8d7fb0
to
52e24e1
Compare
context->packed_pointer_to_container_[aligned_space] = | ||
std::move(data_container); | ||
return aligned_space; | ||
} catch (std::exception& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any particular exception type you're expecting? usually not a good practice to catch base exception, recommend specifying the exception type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use try..catch
in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is here because Mergen requested asserting non-null output from std::align (it returns null on error if it can't align to the requested params). We also want non-fatal OOM handling. XNNPACK can gracefully handle null weight pointers (at least from a quick glance).
Since asserts are fatal, I needed to distinguish between null in the context of memory allocation failed vs null due to std::align failing. String wont return null data, but since I'm adding the assert, I want to make it robust for if we swap to using the ET allocator later. String can throw both bad_alloc and length_error, which both should be treated as an allocation failure. I'm not entirely sure why the code is using a string over a unique_ptr<uint8_t[]> or similar, but I wanted to keep the changes minimal. I'm open to suggestions.
sigh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamping to unblock, let's cp in 1.0
This is a bit tricky as reproing the issue as is requires the allocation to be in the upper half of the address space. We do cover this code in CI in a number of places, but not on the right platforms. We could refactor it a bunch to allow a mock allocator in, but this is largely just testing the alignment logic. Since it's using std::align now, I don't necessarily see a ton of value in that. I think the best takeaway would be to run more end to end model tests on Android (maybe via emulator?) in CI. That likely would have caught the bug, given that we see it pretty reliably on the Android demo app. |
52e24e1
to
085298e
Compare
085298e
to
8e91a58
Compare
I'm re-testing locally and in CI after rebasing and minor cleanup. Hitting some seemingly unrelated errors with class loading of the Android LLM extension. Will land once I figure this out if everything looks good. Hopefully within the next hour or two. |
8e91a58
to
ddcaa8f
Compare
Hi @GregoryComer what is the android error? Is it consistent? |
Here's what I'm seeing. Any ideas?
This started after I rebased and switched branches, as well as closed and re-setup my terminal env, so I'm guessing it's on my end. CI seems to pass. I'm testing on the base commit now. |
Does #15067 fix the issue? |
ddcaa8f
to
bb7a836
Compare
I'm still seeing it after rebasing onto the latest main. Since CI is clean, I'm hopeful that it's a local issue. I'm going to merge, pick, and build RC3 and validate there. |
@pytorchbot cherrypick --onto release/1.0 -c critical |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot cherry-pick --onto release/1.0 -c critical |
### Summary We're seeing crashes on Android when running XNNPACK-delegated models. I tracked it down to a bug in the alignment calculation for weight cache memory. To make the calculation, it casts the void* to a (signed) intptr_t. When the address is in the upper half of the address space, it becomes negative. This causes the modulo to return a negative value and increment the address too much - leading to out of bounds access. https://github.com/pytorch/executorch/blob/cc6cb837d6ac92f52a2d30a405900caf115f0556/backends/xnnpack/runtime/XNNWeightsCache.cpp#L166-L168 Walking through the numbers I captured in #14831: * The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90. * The target alignment is 64 bytes. * Casting the address to intptr_t gives -5476376639047992688. * Mod 64 is -48. * The total offset applied is 64 - (-48) = 112. * Since the allocation size is N + 64, increasing the start by 112 means the new region extends 48 bytes past the end of the allocation. To resolve this, I replaced the alignment code with a call to std::align. Casing to uintptr_t also resolves it, but using the standard implementation seems less error prone. ### Test plan I've validated that the repro in #14831 does not crash with this change. (cherry picked from commit 7421646)
Cherry picking #15039The cherry pick PR is at #15090 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
### Summary We're seeing crashes on Android when running XNNPACK-delegated models. I tracked it down to a bug in the alignment calculation for weight cache memory. To make the calculation, it casts the void* to a (signed) intptr_t. When the address is in the upper half of the address space, it becomes negative. This causes the modulo to return a negative value and increment the address too much - leading to out of bounds access. https://github.com/pytorch/executorch/blob/cc6cb837d6ac92f52a2d30a405900caf115f0556/backends/xnnpack/runtime/XNNWeightsCache.cpp#L166-L168 Walking through the numbers I captured in #14831: * The raw (unaligned) address of the data buffer is 0xb40000763d4bfa90. * The target alignment is 64 bytes. * Casting the address to intptr_t gives -5476376639047992688. * Mod 64 is -48. * The total offset applied is 64 - (-48) = 112. * Since the allocation size is N + 64, increasing the start by 112 means the new region extends 48 bytes past the end of the allocation. To resolve this, I replaced the alignment code with a call to std::align. Casing to uintptr_t also resolves it, but using the standard implementation seems less error prone. ### Test plan I've validated that the repro in #14831 does not crash with this change.
Hi @GregoryComer what device did you validate? |
I validated on an emulated Pixel 9 from an M1 host. Edit: I tried the RC3 AAR on the same emulator instance and it works - no segfault. |
Summary
We're seeing crashes on Android when running XNNPACK-delegated models. I tracked it down to a bug in the alignment calculation for weight cache memory. To make the calculation, it casts the void* to a (signed) intptr_t. When the address is in the upper half of the address space, it becomes negative. This causes the modulo to return a negative value and increment the address too much - leading to out of bounds access.
executorch/backends/xnnpack/runtime/XNNWeightsCache.cpp
Lines 166 to 168 in cc6cb83
Walking through the numbers I captured in #14831:
To resolve this, I replaced the alignment code with a call to std::align. Casing to uintptr_t also resolves it, but using the standard implementation seems less error prone.
Test plan
I've validated that the repro in #14831 does not crash with this change.