Skip to content

Conversation

SingleAccretion
Copy link

@SingleAccretion SingleAccretion commented Sep 22, 2025

Depends on #3173.

Diffs (WasmDebugging, browser):

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 647755
Total bytes of diff: 646660
Total bytes of delta: -1095 (-0.17% % of base)
Average relative delta: -16.24%
    diff is an improvement
    average relative diff is an improvement

Top method regressions (percentages):
          21 (161.54% of base) : 1001.dasm - Thread::Destroy()
          45 (56.96% of base) : 1000.dasm - Thread::Construct()
          27 (20.77% of base) : 1006.dasm - RhpReversePInvoke
          15 (13.27% of base) : 1002.dasm - Thread::GcScanRoots(void (*)(Object**, ScanContext*, unsigned int), ScanContext*)
           7 ( 6.80% of base) : 1003.dasm - RhpWaitForGC2
           1 ( 2.63% of base) : 1009.dasm - RhpPInvokeReturn

Top methods only present in diff:
         211 (     ∞ of base) : 1053.dasm - RhpReversePInvokeAndPushSparseVirtualUnwindFrame
         184 (     ∞ of base) : 1052.dasm - Thread::ReversePInvokeAttachOrTrapThread_Wasm(unsigned long)
          33 (     ∞ of base) : 1054.dasm - RhpReversePInvokeReturnAndPopSparseVirtualUnwindFrame
          22 (     ∞ of base) : 1051.dasm - Thread::GetTransitionFrame()

Top method improvements (percentages):
         -57 (-78.08% of base) : 1024.dasm - S_P_CoreLib_System_Runtime_GCStress__Initialize
         -22 (-62.86% of base) : 1008.dasm - RhpPInvoke
         -45 (-55.56% of base) : 1047.dasm - S_P_CoreLib_System_Threading_Thread__LongSpinWait
         -45 (-55.56% of base) : 1033.dasm - S_P_CoreLib_Interop_Sys__LowLevelMonitor_Release
         -45 (-55.56% of base) : 1017.dasm - S_P_CoreLib_Interop_Sys__Free
         -45 (-52.94% of base) : 1016.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhGetGcTotalMemory
         -43 (-51.81% of base) : 1014.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhEndNoGCRegion
         -43 (-50.00% of base) : 1043.dasm - S_P_CoreLib_System_Threading_Thread__Yield
         -45 (-48.91% of base) : 1018.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhCollect
         -45 (-42.86% of base) : 1015.dasm - S_P_CoreLib_System_Runtime_InternalCalls__RhStartNoGCRegion
         -45 (-41.28% of base) : 1042.dasm - S_P_CoreLib_System_Runtime_InteropServices_NativeMemory__Alloc_0
         -43 (-40.19% of base) : 1037.dasm - S_P_CoreLib_Internal_Runtime_FrozenObjectHeapManager__ClrVirtualReserve
         -41 (-39.81% of base) : 1046.dasm - S_P_CoreLib_System_Buffer___ZeroMemory
         -45 (-37.50% of base) : 1013.dasm - S_P_CoreLib_Internal_Runtime_CompilerHelpers_StartupCodeHelpers__InitializeModuleFrozenObjectSegment
         -41 (-37.27% of base) : 1044.dasm - S_P_CoreLib_System_Buffer___Memmove
          -9 (-36.00% of base) : 1007.dasm - RhpReversePInvokeReturn
         -41 (-31.78% of base) : 1031.dasm - S_P_CoreLib_System_Threading_LowLevelMonitor__DisposeCore
         -41 (-28.47% of base) : 1038.dasm - S_P_CoreLib_Internal_Runtime_Augments_RuntimeAugments__InitializeStackTraceIpMap
         -51 (-24.76% of base) : 1027.dasm - S_P_CoreLib_System_Threading_LowLevelLock__SignalWaiter
         -43 (-24.57% of base) : 1035.dasm - S_P_CoreLib_System_Threading_LowLevelMonitor__Initialize

Top methods only present in base:
         -66 (-100.00% of base) : 1005.dasm - RhpGetOrInitShadowStackTop
         -15 (-100.00% of base) : 1004.dasm - RhpReversePInvokeAttachOrTrapThread2

55 total methods with Code Size differences (45 improved, 10 regressed)

Contributes to #3163.

This paves the way for introducing some RPI/PI optimizations and also moves us
a bit closer to the model where all that's possible to do in IR is done in IR.

Setting up the shadow stack in IR is a tricky problem since we don't want to
spill it to an alloca even in debug code. We therefore modify how we refer
to the shadow stack:
1) For the main function, we use an untracked SSA local. Note how we are
   introducing a new concept here - usually all SSA locals are tracked.
2) For funclets, we repurpose the pre-existing PHYSREG node to allow us
   to refer to the LLVM argument directly.

However, the IR representation is just one part of the problem. Another is the
fact codegen needs to refer to the shadow stack directly, for debug info and helper
call generation purposes. Since some of this needs to happen "in the prolog", before
any IR is generated, we also introduce a concept of "late" and "early" prologs.
They are sequenced as follows:

 - "Early"  prolog - codegen (LLVM IR)
 - "Middle" prolog - LSSA (IR)
 - "Late"   prolog - codegen (LLVM IR)

This is almost a zero-diff change; the few diffs are due to the different
ordering of native stack init and shadow stack init for RPI methods.
@SingleAccretion SingleAccretion force-pushed the PI-Opt-Abi branch 2 times, most recently from 106a1e1 to 00e0915 Compare September 23, 2025 19:54
This changes the managed ABI in the following ways:

1) The PI transition frames becomes simply the shadow stack top. This
   "frame" is zero-sized - we don't store the current thread in it,
   since the WASM TLS model allows us to elide it.

   The obvious benefit from this is that the PI path is now almost
   100% optimal: two stores and one load. We can get rid of the load
   in an ST build as well, but that's left for a future change.

2) The RPI transition frame is now always allocated at a zero offset
   and returned directly from the RPI helper. We thus elide the intermediate
   state where we already have the shadow stack, but haven't yet attached
   the thread. This brings us in line with other targets.

3) The sparse virtual unwind frame is now allocated right after
   the RPI frame, and "combined" RPI helpers introduced that both
   effect the transition and push the EH frame.

The RPI changes reduce the number of helper calls that any RPI method
needs to make from 3 to 1 (for epilogs - 2 to 1) in the sparse virtual
unwinding model, and reduce the number of intructions on the critical
path.

Benchmarks:
  Node base:
    Bench_PInvoke took               : 86 ms (8.64 ns / op)
    Bench_ReversePInvoke_Empty took  : 113 ms (11.30 ns / op)
    Bench_ReversePInvoke_WithEH took : 172 ms (17.24 ns / op)

  Node diff:
    Bench_PInvoke took               : 81 ms (8.06 ns / op)
    Bench_ReversePInvoke_Empty took  : 58 ms (5.81 ns / op)
    Bench_ReversePInvoke_WithEH took : 108 ms (10.79 ns / op)

  Wasmtime base:
    Bench_PInvoke took               : 99 ms (9.86 ns / op)
    Bench_ReversePInvoke_Empty took  : 73 ms (7.28 ns / op)
    Bench_ReversePInvoke_WithEH took : 77 ms (7.71 ns / op)

  Wasmtime diff:
    Bench_PInvoke took               : 82 ms (8.16 ns / op)
    Bench_ReversePInvoke_Empty took  : 31 ms (3.06 ns / op)
    Bench_ReversePInvoke_WithEH took : 50 ms (4.98 ns / op)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant