race: Relax compare_exchange success ordering from AcqRel to Release.

briansmith · briansmith · commit bb00466f2b01 · 2025-04-05T09:28:13.000-07:00
[cherry-pick once_cell 2a707eedb459369687ffdb2183ee82fabaa5d97a] See the analogous change in rust-lang/rust#131746 and the discussion in matklad/once_cell#220. What is the effect of this change? Not much, because before we ever execute the `compare_exchange`, we do a load with `Ordering::Acquire`; the `compare_exchange` is in the `#[cold]` path already. Thus, this just mostly clarifies our expectations. See the non-doc comment added under the module's doc comment for the reasoning. How does this change the code gen? Consider this analogous example: ```diff #[no_mangle] fn foo1(y: &mut i32) -> bool { - let r = X.compare_exchange(0, 1, Ordering::AcqRel, Ordering::Acquire).is_ok(); + let r = X.compare_exchange(0, 1, Ordering::Release, Ordering::Acquire).is_ok(); r } ``` On x86_64, there is no change. Here is the generated code before and after: ``` foo1: mov rcx, qword ptr [rip + example::X::h9e1b81da80078af7@GOTPCREL] mov edx, 1 xor eax, eax lock cmpxchg dword ptr [rcx], edx sete al ret example::X::h9e1b81da80078af7: .zero 4 ``` On AArch64, regardless of whether atomics are outlined or not, there is no change. Here is the generated code with inlined atomics: ``` foo1: adrp x8, :got:example::X::h40b04fb69d714de3 ldr x8, [x8, :got_lo12:example::X::h40b04fb69d714de3] .LBB0_1: ldaxr w9, [x8] cbnz w9, .LBB0_4 mov w0, #1 stlxr w9, w0, [x8] cbnz w9, .LBB0_1 ret .LBB0_4: mov w0, wzr clrex ret example::X::h40b04fb69d714de3: .zero 4 ``` For 32-bit ARMv7, with inlined atomics, the resulting diff in the object code is: ```diff @@ -10,14 +10,13 @@ mov r0, #1 strex r2, r0, [r1] cmp r2, #0 - beq .LBB0_5 + bxeq lr ldrex r0, [r1] cmp r0, #0 beq .LBB0_2 .LBB0_4: - mov r0, #0 clrex -.LBB0_5: + mov r0, #0 dmb ish bx lr .LCPI0_0: @@ -54,4 +53,3 @@ example::X::h47e2038445e1c648: .zero 4 ```
diff --git a/src/polyfill/once_cell/race.rs b/src/polyfill/once_cell/race.rs
@@ -19,6 +19,14 @@
 //! `Acquire` and `Release` have very little performance overhead on most
 //! architectures versus `Relaxed`.
 
+// The "atomic orderings" section of the documentation above promises
+// "happens-before" semantics. This drives the choice of orderings in the uses
+// of `compare_exchange` below. On success, the value was zero/null, so there
+// was nothing to acquire (there is never any `Ordering::Release` store of 0).
+// On failure, the value was nonzero, so it was initialized previously (perhaps
+// on another thread) using `Ordering::Release`, so we must use
+// `Ordering::Acquire` to ensure that store "happens-before" this load.
+
 use core::sync::atomic;
 
 use atomic::{AtomicUsize, Ordering};
@@ -102,7 +110,7 @@ impl OnceNonZeroUsize {
         let mut val = f().get();
         let exchange = self
             .inner
-            .compare_exchange(0, val, Ordering::AcqRel, Ordering::Acquire);
+            .compare_exchange(0, val, Ordering::Release, Ordering::Acquire);
         if let Err(old) = exchange {
             val = old;
         }