[NativeAOT-LLVM] Optimize PI and RPI transitions #3177

SingleAccretion · 2025-09-22T23:42:47Z

Depends on #3173.

Diffs (WasmDebugging, browser):

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 647755
Total bytes of diff: 646660
Total bytes of delta: -1095 (-0.17% % of base)
Average relative delta: -16.24%
    diff is an improvement
    average relative diff is an improvement

Top method regressions (percentages):
          21 (161.54% of base) : 1001.dasm - Thread::Destroy()
          45 (56.96% of base) : 1000.dasm - Thread::Construct()
          27 (20.77% of base) : 1006.dasm - RhpReversePInvoke
          15 (13.27% of base) : 1002.dasm - Thread::GcScanRoots(void (*)(Object**, ScanContext*, unsigned int), ScanContext*)
           7 ( 6.80% of base) : 1003.dasm - RhpWaitForGC2
           1 ( 2.63% of base) : 1009.dasm - RhpPInvokeReturn

Top methods only present in diff:
         211 (     ∞ of base) : 1053.dasm - RhpReversePInvokeAndPushSparseVirtualUnwindFrame
         184 (     ∞ of base) : 1052.dasm - Thread::ReversePInvokeAttachOrTrapThread_Wasm(unsigned long)
          33 (     ∞ of base) : 1054.dasm - RhpReversePInvokeReturnAndPopSparseVirtualUnwindFrame
          22 (     ∞ of base) : 1051.dasm - Thread::GetTransitionFrame()

Top method improvements (percentages):
         -57 (-78.08% of base) : 1024.dasm - S_P_CoreLib_System_Runtime_GCStress__Initialize
         -22 (-62.86% of base) : 1008.dasm - RhpPInvoke
         -45 (-55.56% of base) : 1047.dasm - S_P_CoreLib_System_Threading_Thread__LongSpinWait
         -45 (-55.56% of base) : 1033.dasm - S_P_CoreLib_Interop_Sys__LowLevelMonitor_Release
         -45 (-55.56% of base) : 1017.dasm - S_P_CoreLib_Interop_Sys__Free
         -45 (-52.94% of base) : 1016.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhGetGcTotalMemory
         -43 (-51.81% of base) : 1014.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhEndNoGCRegion
         -43 (-50.00% of base) : 1043.dasm - S_P_CoreLib_System_Threading_Thread__Yield
         -45 (-48.91% of base) : 1018.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhCollect
         -45 (-42.86% of base) : 1015.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhStartNoGCRegion
         -45 (-41.28% of base) : 1042.dasm - S_P_CoreLib_System_Runtime_InteropServices_NativeMemory__Alloc_0
         -43 (-40.19% of base) : 1037.dasm - S_P_CoreLib_Internal_Runtime_FrozenObjectHeapManager__ClrVirtualReserve
         -41 (-39.81% of base) : 1046.dasm - S_P_CoreLib_System_Buffer___ZeroMemory
         -45 (-37.50% of base) : 1013.dasm - S_P_CoreLib_Internal_Runtime_CompilerHelpers_StartupCodeHelpers__InitializeModuleFrozenObjectSegment
         -41 (-37.27% of base) : 1044.dasm - S_P_CoreLib_System_Buffer___Memmove
          -9 (-36.00% of base) : 1007.dasm - RhpReversePInvokeReturn
         -41 (-31.78% of base) : 1031.dasm - S_P_CoreLib_System_Threading_LowLevelMonitor__DisposeCore
         -41 (-28.47% of base) : 1038.dasm - S_P_CoreLib_Internal_Runtime_Augments_RuntimeAugments__InitializeStackTraceIpMap
         -51 (-24.76% of base) : 1027.dasm - S_P_CoreLib_System_Threading_LowLevelLock__SignalWaiter
         -43 (-24.57% of base) : 1035.dasm - S_P_CoreLib_System_Threading_LowLevelMonitor__Initialize

Top methods only present in base:
         -66 (-100.00% of base) : 1005.dasm - RhpGetOrInitShadowStackTop
         -15 (-100.00% of base) : 1004.dasm - RhpReversePInvokeAttachOrTrapThread2

55 total methods with Code Size differences (45 improved, 10 regressed)

Contributes to #3163.

This paves the way for introducing some RPI/PI optimizations and also moves us a bit closer to the model where all that's possible to do in IR is done in IR. Setting up the shadow stack in IR is a tricky problem since we don't want to spill it to an alloca even in debug code. We therefore modify how we refer to the shadow stack: 1) For the main function, we use an untracked SSA local. Note how we are introducing a new concept here - usually all SSA locals are tracked. 2) For funclets, we repurpose the pre-existing PHYSREG node to allow us to refer to the LLVM argument directly. However, the IR representation is just one part of the problem. Another is the fact codegen needs to refer to the shadow stack directly, for debug info and helper call generation purposes. Since some of this needs to happen "in the prolog", before any IR is generated, we also introduce a concept of "late" and "early" prologs. They are sequenced as follows: - "Early" prolog - codegen (LLVM IR) - "Middle" prolog - LSSA (IR) - "Late" prolog - codegen (LLVM IR) This is almost a zero-diff change; the few diffs are due to the different ordering of native stack init and shadow stack init for RPI methods.

This changes the managed ABI in the following ways: 1) The PI transition frames becomes simply the shadow stack top. This "frame" is zero-sized - we don't store the current thread in it, since the WASM TLS model allows us to elide it. The obvious benefit from this is that the PI path is now almost 100% optimal: two stores and one load. We can get rid of the load in an ST build as well, but that's left for a future change. 2) The RPI transition frame is now always allocated at a zero offset and returned directly from the RPI helper. We thus elide the intermediate state where we already have the shadow stack, but haven't yet attached the thread. This brings us in line with other targets. 3) The sparse virtual unwind frame is now allocated right after the RPI frame, and "combined" RPI helpers introduced that both effect the transition and push the EH frame. The RPI changes reduce the number of helper calls that any RPI method needs to make from 3 to 1 (for epilogs - 2 to 1) in the sparse virtual unwinding model, and reduce the number of intructions on the critical path. Benchmarks: Node base: Bench_PInvoke took : 86 ms (8.64 ns / op) Bench_ReversePInvoke_Empty took : 113 ms (11.30 ns / op) Bench_ReversePInvoke_WithEH took : 172 ms (17.24 ns / op) Node diff: Bench_PInvoke took : 81 ms (8.06 ns / op) Bench_ReversePInvoke_Empty took : 58 ms (5.81 ns / op) Bench_ReversePInvoke_WithEH took : 108 ms (10.79 ns / op) Wasmtime base: Bench_PInvoke took : 99 ms (9.86 ns / op) Bench_ReversePInvoke_Empty took : 73 ms (7.28 ns / op) Bench_ReversePInvoke_WithEH took : 77 ms (7.71 ns / op) Wasmtime diff: Bench_PInvoke took : 82 ms (8.16 ns / op) Bench_ReversePInvoke_Empty took : 31 ms (3.06 ns / op) Bench_ReversePInvoke_WithEH took : 50 ms (4.98 ns / op)

SingleAccretion force-pushed the PI-Opt-Abi branch 2 times, most recently from 106a1e1 to 00e0915 Compare September 23, 2025 19:54

SingleAccretion force-pushed the PI-Opt-Abi branch from 00e0915 to 61ca405 Compare September 23, 2025 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NativeAOT-LLVM] Optimize PI and RPI transitions #3177

[NativeAOT-LLVM] Optimize PI and RPI transitions #3177

Uh oh!

SingleAccretion commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

[NativeAOT-LLVM] Optimize PI and RPI transitions #3177

Are you sure you want to change the base?

[NativeAOT-LLVM] Optimize PI and RPI transitions #3177

Uh oh!

Conversation

SingleAccretion commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

SingleAccretion commented Sep 22, 2025 •

edited

Loading