Potential to weaken ordering for success case of `compare_exhange`s in `race` module #220

Imberflur · 2023-02-14T04:14:25Z

For example, the compare exchange for OnceNonZeroUsize uses AcqRel for the success case:

Line 100 in d706539

self.inner.compare_exchange(0, val, Ordering::AcqRel, Ordering::Acquire);

The Acquire portion of the AcqRel here is for the load of 0, however there is no Release store of 0 so this will not synchronize with anything. Note that the construction of OnceNonZeroUsize still has a happens-before relationship to this since we have an &self reference. Thus, I think a Release ordering is sufficient here for the desired synchronization to be achieved (including taking this documentation into account):

 self.inner.compare_exchange(0, val, Ordering::Release, Ordering::Acquire);

I believe this applies to all uses of compare_exchange in this module.

(happy to make a PR if this looks reasonable)

The text was updated successfully, but these errors were encountered:

matklad · 2023-02-14T11:23:35Z

This does look reasonable to me, yeah! Although, as usual, I am unnerved by the fact that we don't have any tools to check this (or maybe I am missing them?).

Could you send a PR? You can also adjust the Changelog and bump the patch version in Cargo.toml, so that we release this together with #219

Since the load during success for compare exchange is `0`/`null` and no `Release` store of this value exists, using an `Aquire` load for this case adds no additional synchronization. Fixes matklad#220

taiki-e · 2023-02-14T14:54:22Z

Note that CAS failure ordering that is stronger than success ordering requires Rust 1.64+: rust-lang/rust#98383

Imberflur · 2023-02-14T15:08:14Z

Note that CAS failure ordering that is stronger than success ordering requires Rust 1.64+: rust-lang/rust#98383

Gah! 🙁

Although, as usual, I am unnerved by the fact that we don't have any tools to check this (or maybe I am missing them?).

The only thing that comes to my mind is testing with loom. I don't know if it is exactly what you are looking for though.

briansmith · 2025-03-05T18:33:02Z

Does anybody know of how the codegen would change such that this would result in better performance for users of the race module?

briansmith · 2025-03-12T16:49:18Z

I believe it is good to raise MSRV to 1.64 so we can do this. But it would be good to know what the real benefit would be.

Imberflur · 2025-03-13T04:13:27Z

FWIW I don't have any specific motivating case for this. I asked about this on the community discord and someone mentioned:

so in theory llvm can optimize it in some additional ways, but it doesn't in simple cases

on armv7 llvm avoids the trailing barrier on llsc success https://rust.godbolt.org/z/jEzaMTWar

code from godbolt link

use std::sync::atomic::*;
pub static X: AtomicU32 = AtomicU32::new(0);
#[unsafe(no_mangle)]
fn foo1() -> bool {
    X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok()
}

#[unsafe(no_mangle)]
fn bar1() -> bool {
    X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok()
}

compiled for --target=armv7-unknown-linux-gnueabi

foo1:
        ldr     r1, .LCPI0_0
.LPC0_0:
        ldr     r1, [pc, r1]
        ldrex   r0, [r1]
        cmp     r0, #0
        bne     .LBB0_4
        dmb     ish
.LBB0_2:
        mov     r0, #1
        strex   r2, r0, [r1]
        cmp     r2, #0
        beq     .LBB0_5
        ldrex   r0, [r1]
        cmp     r0, #0
        beq     .LBB0_2
.LBB0_4:
        mov     r0, #0
        clrex
.LBB0_5:
        dmb     ish
        bx      lr
.LCPI0_0:
.Ltmp9:
        .long   example::X(GOT_PREL)-((.LPC0_0+8)-.Ltmp9)

bar1:
        ldr     r1, .LCPI1_0
.LPC1_0:
        ldr     r1, [pc, r1]
        ldrex   r0, [r1]
        cmp     r0, #0
        bne     .LBB1_4
        dmb     ish
.LBB1_2:
        mov     r0, #1
        strex   r2, r0, [r1]
        cmp     r2, #0
        bxeq    lr
        ldrex   r0, [r1]
        cmp     r0, #0
        beq     .LBB1_2
.LBB1_4:
        clrex

in particular, note the bxeq lr vs beq .LBB0_5 (which is dmb ish + bx lr)

briansmith · 2025-03-13T18:38:29Z

Since the MSRV is 1.65 now, I submitted PR #278 to start this.

See the analogous change in rust-lang/rust#131746 and the discussion in matklad#220. What is the effect of this change? Consider this example: ```diff #[no_mangle] fn foo1(y: &mut i32) -> bool { - let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok(); + let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok(); r } ``` On x86_64, there is no change. Here is the generated code before and after: ``` foo1: mov rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL] mov edx, 1 xor eax, eax lock cmpxchg dword ptr [rcx], edx sete al ret example::X::h9e1b81da80078af7: .zero 4 ``` On AArch64, regardless of whether atomics are outlined or not, there is no change. Here is the generated code with inlined atomics: ``` foo1: adrp x8, :got:example::X::h40b04fb69d714de3 ldr x8, [x8, :got_lo12:example::X::h40b04fb69d714de3] .LBB0_1: ldaxr w9, [x8] cbnz w9, .LBB0_4 mov w0, matklad#1 stlxr w9, w0, [x8] cbnz w9, .LBB0_1 ret .LBB0_4: mov w0, wzr clrex ret example::X::h40b04fb69d714de3: .zero 4 ``` For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is: ```diff @@ -10,14 +10,13 @@ mov r0, matklad#1 strex r2, r0, [r1] cmp r2, #0 - beq .LBB0_5 + bxeq lr ldrex r0, [r1] cmp r0, #0 beq .LBB0_2 .LBB0_4: - mov r0, #0 clrex -.LBB0_5: + mov r0, #0 dmb ish bx lr .LCPI0_0: @@ -54,4 +53,3 @@ example::X::h47e2038445e1c648: .zero 4 ```

…ease`. See the analogous change in rust-lang/rust#131746 and the discussion in matklad#220. What is the effect of this change? Not much, because before we ever execute the `compare_exchange`, we do a load with `Ordering::Acquire`; the `compare_exchange` is in the `#[cold]` path already. Thus, this just mostly clarifies our expectations. How does this change the code gen? Consider this analogous example: ```diff #[no_mangle] fn foo1(y: &mut i32) -> bool { - let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok(); + let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok(); r } ``` On x86_64, there is no change. Here is the generated code before and after: ``` foo1: mov rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL] mov edx, 1 xor eax, eax lock cmpxchg dword ptr [rcx], edx sete al ret example::X::h9e1b81da80078af7: .zero 4 ``` On AArch64, regardless of whether atomics are outlined or not, there is no change. Here is the generated code with inlined atomics: ``` foo1: adrp x8, :got:example::X::h40b04fb69d714de3 ldr x8, [x8, :got_lo12:example::X::h40b04fb69d714de3] .LBB0_1: ldaxr w9, [x8] cbnz w9, .LBB0_4 mov w0, matklad#1 stlxr w9, w0, [x8] cbnz w9, .LBB0_1 ret .LBB0_4: mov w0, wzr clrex ret example::X::h40b04fb69d714de3: .zero 4 ``` For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is: ```diff @@ -10,14 +10,13 @@ mov r0, matklad#1 strex r2, r0, [r1] cmp r2, #0 - beq .LBB0_5 + bxeq lr ldrex r0, [r1] cmp r0, #0 beq .LBB0_2 .LBB0_4: - mov r0, #0 clrex -.LBB0_5: + mov r0, #0 dmb ish bx lr .LCPI0_0: @@ -54,4 +53,3 @@ example::X::h47e2038445e1c648: .zero 4 ```

briansmith · 2025-03-15T18:43:02Z

I updated PR #278 and I think it is now ready for review. But note that there is unlikely to be any significant benefit from doing this, since it is all happening in the cold path for users of get_or_init/get_or_try_init.

…ease`. See the analogous change in rust-lang/rust#131746 and the discussion in matklad#220. What is the effect of this change? Not much, because before we ever execute the `compare_exchange`, we do a load with `Ordering::Acquire`; the `compare_exchange` is in the `#[cold]` path already. Thus, this just mostly clarifies our expectations. See the non-doc comment added under the module's doc comment for the reasoning. How does this change the code gen? Consider this analogous example: ```diff #[no_mangle] fn foo1(y: &mut i32) -> bool { - let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok(); + let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok(); r } ``` On x86_64, there is no change. Here is the generated code before and after: ``` foo1: mov rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL] mov edx, 1 xor eax, eax lock cmpxchg dword ptr [rcx], edx sete al ret example::X::h9e1b81da80078af7: .zero 4 ``` On AArch64, regardless of whether atomics are outlined or not, there is no change. Here is the generated code with inlined atomics: ``` foo1: adrp x8, :got:example::X::h40b04fb69d714de3 ldr x8, [x8, :got_lo12:example::X::h40b04fb69d714de3] .LBB0_1: ldaxr w9, [x8] cbnz w9, .LBB0_4 mov w0, matklad#1 stlxr w9, w0, [x8] cbnz w9, .LBB0_1 ret .LBB0_4: mov w0, wzr clrex ret example::X::h40b04fb69d714de3: .zero 4 ``` For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is: ```diff @@ -10,14 +10,13 @@ mov r0, matklad#1 strex r2, r0, [r1] cmp r2, #0 - beq .LBB0_5 + bxeq lr ldrex r0, [r1] cmp r0, #0 beq .LBB0_2 .LBB0_4: - mov r0, #0 clrex -.LBB0_5: + mov r0, #0 dmb ish bx lr .LCPI0_0: @@ -54,4 +53,3 @@ example::X::h47e2038445e1c648: .zero 4 ```

Imberflur mentioned this issue Feb 14, 2023

AcqRel -> Release for compare_exchange success ordering in race module #221

Closed

This was referenced Mar 13, 2025

Reduce MSRV to 1.65 by polyfilling AtomicUsize::as_ptr(). #277

Merged

race: Relax success ordering from AcqRel to Release. #278

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential to weaken ordering for success case of `compare_exhange`s in `race` module #220

Potential to weaken ordering for success case of `compare_exhange`s in `race` module #220

Imberflur commented Feb 14, 2023

matklad commented Feb 14, 2023

taiki-e commented Feb 14, 2023

Imberflur commented Feb 14, 2023

briansmith commented Mar 5, 2025

briansmith commented Mar 12, 2025

Imberflur commented Mar 13, 2025

briansmith commented Mar 13, 2025

briansmith commented Mar 15, 2025

Potential to weaken ordering for success case of compare_exhanges in race module #220

Potential to weaken ordering for success case of compare_exhanges in race module #220

Comments

Imberflur commented Feb 14, 2023

matklad commented Feb 14, 2023

taiki-e commented Feb 14, 2023

Imberflur commented Feb 14, 2023

briansmith commented Mar 5, 2025

briansmith commented Mar 12, 2025

Imberflur commented Mar 13, 2025

briansmith commented Mar 13, 2025

briansmith commented Mar 15, 2025

Potential to weaken ordering for success case of `compare_exhange`s in `race` module #220

Potential to weaken ordering for success case of `compare_exhange`s in `race` module #220