Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential to weaken ordering for success case of compare_exhanges in race module #220

Open
Imberflur opened this issue Feb 14, 2023 · 8 comments

Comments

@Imberflur
Copy link
Contributor

For example, the compare exchange for OnceNonZeroUsize uses AcqRel for the success case:

self.inner.compare_exchange(0, val, Ordering::AcqRel, Ordering::Acquire);

The Acquire portion of the AcqRel here is for the load of 0, however there is no Release store of 0 so this will not synchronize with anything. Note that the construction of OnceNonZeroUsize still has a happens-before relationship to this since we have an &self reference. Thus, I think a Release ordering is sufficient here for the desired synchronization to be achieved (including taking this documentation into account):

 self.inner.compare_exchange(0, val, Ordering::Release, Ordering::Acquire);

I believe this applies to all uses of compare_exchange in this module.


(happy to make a PR if this looks reasonable)

@matklad
Copy link
Owner

matklad commented Feb 14, 2023

This does look reasonable to me, yeah! Although, as usual, I am unnerved by the fact that we don't have any tools to check this (or maybe I am missing them?).

Could you send a PR? You can also adjust the Changelog and bump the patch version in Cargo.toml, so that we release this together with #219

Imberflur added a commit to Imberflur/once_cell that referenced this issue Feb 14, 2023
Since the load during success for compare exchange is `0`/`null` and no
`Release` store of this value exists, using an `Aquire` load for this
case adds no additional synchronization.

Fixes matklad#220
@taiki-e
Copy link
Contributor

taiki-e commented Feb 14, 2023

Note that CAS failure ordering that is stronger than success ordering requires Rust 1.64+: rust-lang/rust#98383

@Imberflur
Copy link
Contributor Author

Note that CAS failure ordering that is stronger than success ordering requires Rust 1.64+: rust-lang/rust#98383

Gah! 🙁

Although, as usual, I am unnerved by the fact that we don't have any tools to check this (or maybe I am missing them?).

The only thing that comes to my mind is testing with loom. I don't know if it is exactly what you are looking for though.

@briansmith
Copy link
Contributor

Does anybody know of how the codegen would change such that this would result in better performance for users of the race module?

@briansmith
Copy link
Contributor

I believe it is good to raise MSRV to 1.64 so we can do this. But it would be good to know what the real benefit would be.

@Imberflur
Copy link
Contributor Author

FWIW I don't have any specific motivating case for this. I asked about this on the community discord and someone mentioned:

so in theory llvm can optimize it in some additional ways, but it doesn't in simple cases

on armv7 llvm avoids the trailing barrier on llsc success https://rust.godbolt.org/z/jEzaMTWar

code from godbolt link

use std::sync::atomic::*;
pub static X: AtomicU32 = AtomicU32::new(0);
#[unsafe(no_mangle)]
fn foo1() -> bool {
    X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok()
}

#[unsafe(no_mangle)]
fn bar1() -> bool {
    X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok()
}

compiled for --target=armv7-unknown-linux-gnueabi

foo1:
        ldr     r1, .LCPI0_0
.LPC0_0:
        ldr     r1, [pc, r1]
        ldrex   r0, [r1]
        cmp     r0, #0
        bne     .LBB0_4
        dmb     ish
.LBB0_2:
        mov     r0, #1
        strex   r2, r0, [r1]
        cmp     r2, #0
        beq     .LBB0_5
        ldrex   r0, [r1]
        cmp     r0, #0
        beq     .LBB0_2
.LBB0_4:
        mov     r0, #0
        clrex
.LBB0_5:
        dmb     ish
        bx      lr
.LCPI0_0:
.Ltmp9:
        .long   example::X(GOT_PREL)-((.LPC0_0+8)-.Ltmp9)

bar1:
        ldr     r1, .LCPI1_0
.LPC1_0:
        ldr     r1, [pc, r1]
        ldrex   r0, [r1]
        cmp     r0, #0
        bne     .LBB1_4
        dmb     ish
.LBB1_2:
        mov     r0, #1
        strex   r2, r0, [r1]
        cmp     r2, #0
        bxeq    lr
        ldrex   r0, [r1]
        cmp     r0, #0
        beq     .LBB1_2
.LBB1_4:
        clrex

in particular, note the bxeq lr vs beq .LBB0_5 (which is dmb ish + bx lr)

@briansmith
Copy link
Contributor

Since the MSRV is 1.65 now, I submitted PR #278 to start this.

briansmith added a commit to briansmith/once_cell that referenced this issue Mar 15, 2025
See the analogous change in rust-lang/rust#131746 and
the discussion in matklad#220.

What is the effect of this change? Consider this example:

```diff
 #[no_mangle]
 fn foo1(y: &mut i32) -> bool {
-    let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok();
+    let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok();
     r
 }

```

On x86_64, there is no change. Here is the generated code before and after:

```
foo1:
        mov     rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL]
        mov     edx, 1
        xor     eax, eax
        lock            cmpxchg dword ptr [rcx], edx
        sete    al
        ret

example::X::h9e1b81da80078af7:
        .zero   4
```

On AArch64, regardless of whether atomics are outlined or not, there is no change. Here
is the generated code with inlined atomics:

```
foo1:
        adrp    x8, :got:example::X::h40b04fb69d714de3
        ldr     x8, [x8, :got_lo12:example::X::h40b04fb69d714de3]
.LBB0_1:
        ldaxr   w9, [x8]
        cbnz    w9, .LBB0_4
        mov     w0, matklad#1
        stlxr   w9, w0, [x8]
        cbnz    w9, .LBB0_1
        ret
.LBB0_4:
        mov     w0, wzr
        clrex
        ret

example::X::h40b04fb69d714de3:
        .zero   4
```

For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is:

```diff
@@ -10,14 +10,13 @@
         mov     r0, matklad#1
         strex   r2, r0, [r1]
         cmp     r2, #0
-        beq     .LBB0_5
+        bxeq    lr
         ldrex   r0, [r1]
         cmp     r0, #0
         beq     .LBB0_2
 .LBB0_4:
-        mov     r0, #0
         clrex
-.LBB0_5:
+        mov     r0, #0
         dmb     ish
         bx      lr
 .LCPI0_0:
@@ -54,4 +53,3 @@

 example::X::h47e2038445e1c648:
         .zero   4
```
briansmith added a commit to briansmith/once_cell that referenced this issue Mar 15, 2025
See the analogous change in rust-lang/rust#131746 and
the discussion in matklad#220.

What is the effect of this change? Consider this example:

```diff
 #[no_mangle]
 fn foo1(y: &mut i32) -> bool {
-    let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok();
+    let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok();
     r
 }

```

On x86_64, there is no change. Here is the generated code before and after:

```
foo1:
        mov     rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL]
        mov     edx, 1
        xor     eax, eax
        lock            cmpxchg dword ptr [rcx], edx
        sete    al
        ret

example::X::h9e1b81da80078af7:
        .zero   4
```

On AArch64, regardless of whether atomics are outlined or not, there is no change. Here
is the generated code with inlined atomics:

```
foo1:
        adrp    x8, :got:example::X::h40b04fb69d714de3
        ldr     x8, [x8, :got_lo12:example::X::h40b04fb69d714de3]
.LBB0_1:
        ldaxr   w9, [x8]
        cbnz    w9, .LBB0_4
        mov     w0, matklad#1
        stlxr   w9, w0, [x8]
        cbnz    w9, .LBB0_1
        ret
.LBB0_4:
        mov     w0, wzr
        clrex
        ret

example::X::h40b04fb69d714de3:
        .zero   4
```

For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is:

```diff
@@ -10,14 +10,13 @@
         mov     r0, matklad#1
         strex   r2, r0, [r1]
         cmp     r2, #0
-        beq     .LBB0_5
+        bxeq    lr
         ldrex   r0, [r1]
         cmp     r0, #0
         beq     .LBB0_2
 .LBB0_4:
-        mov     r0, #0
         clrex
-.LBB0_5:
+        mov     r0, #0
         dmb     ish
         bx      lr
 .LCPI0_0:
@@ -54,4 +53,3 @@

 example::X::h47e2038445e1c648:
         .zero   4
```
briansmith added a commit to briansmith/once_cell that referenced this issue Mar 15, 2025
…ease`.

See the analogous change in
rust-lang/rust#131746 and the discussion in
matklad#220.

What is the effect of this change? Not much, because before we ever
execute the `compare_exchange`, we do a load with `Ordering::Acquire`;
the `compare_exchange` is in the `#[cold]` path already.

Thus, this just mostly clarifies our expectations.

How does this change the code gen? Consider this analogous example:

```diff
 #[no_mangle]
 fn foo1(y: &mut i32) -> bool {
-    let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok();
+    let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok();
     r
 }

```

On x86_64, there is no change. Here is the generated code before and after:

```
foo1:
        mov     rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL]
        mov     edx, 1
        xor     eax, eax
        lock            cmpxchg dword ptr [rcx], edx
        sete    al
        ret

example::X::h9e1b81da80078af7:
        .zero   4
```

On AArch64, regardless of whether atomics are outlined or not, there is no change. Here
is the generated code with inlined atomics:

```
foo1:
        adrp    x8, :got:example::X::h40b04fb69d714de3
        ldr     x8, [x8, :got_lo12:example::X::h40b04fb69d714de3]
.LBB0_1:
        ldaxr   w9, [x8]
        cbnz    w9, .LBB0_4
        mov     w0, matklad#1
        stlxr   w9, w0, [x8]
        cbnz    w9, .LBB0_1
        ret
.LBB0_4:
        mov     w0, wzr
        clrex
        ret

example::X::h40b04fb69d714de3:
        .zero   4
```

For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is:

```diff
@@ -10,14 +10,13 @@
         mov     r0, matklad#1
         strex   r2, r0, [r1]
         cmp     r2, #0
-        beq     .LBB0_5
+        bxeq    lr
         ldrex   r0, [r1]
         cmp     r0, #0
         beq     .LBB0_2
 .LBB0_4:
-        mov     r0, #0
         clrex
-.LBB0_5:
+        mov     r0, #0
         dmb     ish
         bx      lr
 .LCPI0_0:
@@ -54,4 +53,3 @@

 example::X::h47e2038445e1c648:
         .zero   4
```
@briansmith
Copy link
Contributor

I updated PR #278 and I think it is now ready for review. But note that there is unlikely to be any significant benefit from doing this, since it is all happening in the cold path for users of get_or_init/get_or_try_init.

briansmith added a commit to briansmith/once_cell that referenced this issue Mar 17, 2025
…ease`.

See the analogous change in
rust-lang/rust#131746 and the discussion in
matklad#220.

What is the effect of this change? Not much, because before we ever
execute the `compare_exchange`, we do a load with `Ordering::Acquire`;
the `compare_exchange` is in the `#[cold]` path already. Thus, this just mostly
clarifies our expectations. See the non-doc comment added under the module's doc
comment for the reasoning.

How does this change the code gen? Consider this analogous example:

```diff
 #[no_mangle]
 fn foo1(y: &mut i32) -> bool {
-    let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok();
+    let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok();
     r
 }

```

On x86_64, there is no change. Here is the generated code before and after:

```
foo1:
        mov     rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL]
        mov     edx, 1
        xor     eax, eax
        lock            cmpxchg dword ptr [rcx], edx
        sete    al
        ret

example::X::h9e1b81da80078af7:
        .zero   4
```

On AArch64, regardless of whether atomics are outlined or not, there is no change. Here
is the generated code with inlined atomics:

```
foo1:
        adrp    x8, :got:example::X::h40b04fb69d714de3
        ldr     x8, [x8, :got_lo12:example::X::h40b04fb69d714de3]
.LBB0_1:
        ldaxr   w9, [x8]
        cbnz    w9, .LBB0_4
        mov     w0, matklad#1
        stlxr   w9, w0, [x8]
        cbnz    w9, .LBB0_1
        ret
.LBB0_4:
        mov     w0, wzr
        clrex
        ret

example::X::h40b04fb69d714de3:
        .zero   4
```

For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is:

```diff
@@ -10,14 +10,13 @@
         mov     r0, matklad#1
         strex   r2, r0, [r1]
         cmp     r2, #0
-        beq     .LBB0_5
+        bxeq    lr
         ldrex   r0, [r1]
         cmp     r0, #0
         beq     .LBB0_2
 .LBB0_4:
-        mov     r0, #0
         clrex
-.LBB0_5:
+        mov     r0, #0
         dmb     ish
         bx      lr
 .LCPI0_0:
@@ -54,4 +53,3 @@

 example::X::h47e2038445e1c648:
         .zero   4
```
briansmith added a commit to briansmith/once_cell that referenced this issue Mar 17, 2025
…ease`.

See the analogous change in
rust-lang/rust#131746 and the discussion in
matklad#220.

What is the effect of this change? Not much, because before we ever
execute the `compare_exchange`, we do a load with `Ordering::Acquire`;
the `compare_exchange` is in the `#[cold]` path already. Thus, this just mostly
clarifies our expectations. See the non-doc comment added under the module's doc
comment for the reasoning.

How does this change the code gen? Consider this analogous example:

```diff
 #[no_mangle]
 fn foo1(y: &mut i32) -> bool {
-    let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok();
+    let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok();
     r
 }

```

On x86_64, there is no change. Here is the generated code before and after:

```
foo1:
        mov     rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL]
        mov     edx, 1
        xor     eax, eax
        lock            cmpxchg dword ptr [rcx], edx
        sete    al
        ret

example::X::h9e1b81da80078af7:
        .zero   4
```

On AArch64, regardless of whether atomics are outlined or not, there is no change. Here
is the generated code with inlined atomics:

```
foo1:
        adrp    x8, :got:example::X::h40b04fb69d714de3
        ldr     x8, [x8, :got_lo12:example::X::h40b04fb69d714de3]
.LBB0_1:
        ldaxr   w9, [x8]
        cbnz    w9, .LBB0_4
        mov     w0, matklad#1
        stlxr   w9, w0, [x8]
        cbnz    w9, .LBB0_1
        ret
.LBB0_4:
        mov     w0, wzr
        clrex
        ret

example::X::h40b04fb69d714de3:
        .zero   4
```

For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is:

```diff
@@ -10,14 +10,13 @@
         mov     r0, matklad#1
         strex   r2, r0, [r1]
         cmp     r2, #0
-        beq     .LBB0_5
+        bxeq    lr
         ldrex   r0, [r1]
         cmp     r0, #0
         beq     .LBB0_2
 .LBB0_4:
-        mov     r0, #0
         clrex
-.LBB0_5:
+        mov     r0, #0
         dmb     ish
         bx      lr
 .LCPI0_0:
@@ -54,4 +53,3 @@

 example::X::h47e2038445e1c648:
         .zero   4
```
briansmith added a commit to briansmith/once_cell that referenced this issue Mar 17, 2025
…ease`.

See the analogous change in
rust-lang/rust#131746 and the discussion in
matklad#220.

What is the effect of this change? Not much, because before we ever
execute the `compare_exchange`, we do a load with `Ordering::Acquire`;
the `compare_exchange` is in the `#[cold]` path already. Thus, this just mostly
clarifies our expectations. See the non-doc comment added under the module's doc
comment for the reasoning.

How does this change the code gen? Consider this analogous example:

```diff
 #[no_mangle]
 fn foo1(y: &mut i32) -> bool {
-    let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok();
+    let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok();
     r
 }

```

On x86_64, there is no change. Here is the generated code before and after:

```
foo1:
        mov     rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL]
        mov     edx, 1
        xor     eax, eax
        lock            cmpxchg dword ptr [rcx], edx
        sete    al
        ret

example::X::h9e1b81da80078af7:
        .zero   4
```

On AArch64, regardless of whether atomics are outlined or not, there is no change. Here
is the generated code with inlined atomics:

```
foo1:
        adrp    x8, :got:example::X::h40b04fb69d714de3
        ldr     x8, [x8, :got_lo12:example::X::h40b04fb69d714de3]
.LBB0_1:
        ldaxr   w9, [x8]
        cbnz    w9, .LBB0_4
        mov     w0, matklad#1
        stlxr   w9, w0, [x8]
        cbnz    w9, .LBB0_1
        ret
.LBB0_4:
        mov     w0, wzr
        clrex
        ret

example::X::h40b04fb69d714de3:
        .zero   4
```

For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is:

```diff
@@ -10,14 +10,13 @@
         mov     r0, matklad#1
         strex   r2, r0, [r1]
         cmp     r2, #0
-        beq     .LBB0_5
+        bxeq    lr
         ldrex   r0, [r1]
         cmp     r0, #0
         beq     .LBB0_2
 .LBB0_4:
-        mov     r0, #0
         clrex
-.LBB0_5:
+        mov     r0, #0
         dmb     ish
         bx      lr
 .LCPI0_0:
@@ -54,4 +53,3 @@

 example::X::h47e2038445e1c648:
         .zero   4
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants