Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can FFI code use memory_order_consume #528

Open
chorman0773 opened this issue Aug 17, 2024 · 9 comments
Open

Can FFI code use memory_order_consume #528

chorman0773 opened this issue Aug 17, 2024 · 9 comments

Comments

@chorman0773
Copy link
Contributor

chorman0773 commented Aug 17, 2024

For Atomics and our Multithreaded Memory model, we incorporate the C++ rules by reference in Minirust's definition. We do not define Ordering::Consume, but I do not believe we exclude its existance. Would it thus be valid for FFI code (or assembly) to perform a consume operation that reads from a release operation, and would the consume behaviour (dependency-ordered before) follow into rust code that executes after?

Formally, is the following code defined behaviour, given no other synchronization, where thread1 and thread2 are entrypoints for separate threads:

#[no_mangle]
pub static IMPORTANT_DATA: AtomicPtr<i32> = AtomicPtr::new(core::ptr::null());

pub fn thread1(){
     DATA.store(Box::into_raw(Box::new(5i3)), Ordering::Release);
}

unsafe extern "C"{
     fn get_important_data() -> *mut i32;
}

pub fn thread2(){
     let ptr = unsafe{Box::from_raw(get_important_data())};
     println!("{ptr}");
}
extern "C"{
    extern std::atomic<int32_t*>) IMPORTANT_DATA;
     int32_t* get_important_data(){
        return IMPORTANT_DATA.load(std::memory_order::consume); // not acquire
        // assume this is done smartly in a loop that checks for null, and exchanges in `null`.
     } 
}

(or equivalent C18 code).

@comex
Copy link

comex commented Aug 18, 2024

Are there any C or C++ compilers that actually implement memory_order_consume as intended, i.e. by relying on architecture-provided dependency-ordering guarantees?

Does yours?

As far as I know, all of the major compilers found it too hard to preserve dependencies through the optimization process, gave up, and currently resort to treating memory_order_consume as a synonym for memory_order_acquire (i.e. emitting a memory barrier). And from what I've heard, this will probably never change; if there's a path forward for consume ordering it's probably through a more-limited replacement API like p0750.

Certainly this is true for Clang. And LLVM's inability to preserve dependencies also applies to rustc. Therefore, if the Rust part of your example is compiled using rustc with its default backend, it can be guaranteed to work if and only if the C++ part of the example is compiled with a compiler that implements memory_order_consume as memory_order_acquire.

Therefore, if Rust were to make a language-level guarantee that the example works, it would be tantamount to assuming that no compiler (that rustc cares about interoperating with) will ever implement memory_order_consume properly.

Which might actually be a valid assumption! But it also feels dangerous.

Arguably this is something that platform ABIs ought to specify one way or the other. After all, you could imagine a scenario similar to your example, but instead of Rust calling into C++, it's 'C++ compiled by compiler A' calling into 'C++ compiled by compiler B'. In that case, the example would be defined behavior, but for it to work, the two C++ compilers would have to agree on how memory_order_consume should be implemented.

On the other hand, you could argue that, from a C and C++ perspective, atomics are a standard library feature, and 'how to use standard library functionality' isn't usually part of the ABI specifications…

In any case, I haven't seen any ABI specifications go into detail about this. Of the ones I checked, the x86-64 and ARM ABI specifications don't explicitly define a mapping from the C++ atomics model to the instruction set at all, even though there are multiple possible mappings. I guess the standard mapping is just an informal convention. In contrast, the RISC-V ABI specification does define a mapping, but that document conspicuously omits any mention of consume.

@chorman0773
Copy link
Contributor Author

chorman0773 commented Aug 18, 2024

It may be reasonable for a compiler to handle consume optimizations like acquire, but lower to hardware instructions that are only correct for the former (ARM's default load/stores come to mind). This is something I've considered for lccc, yes (though xlang does omit a atomic consume access-class at the current time). This is more a theoretical question as to whether or not Rust fully inherits the C++ MT Memory Model, including parts we don't necessarily expose through the language.

A more targetted question is whether replacing the C++ code with assembly code on, e.g. ARM, that just does a default load, is correct. Such a load can't be considered to be an acquire load, but it's UB if the load in the Rust AM is only a Relaxed load.

As a note about platform ABI, I am currently adding atomics code to the psABI for my own architecture, and while it doesn't mention consume by name, it only has - and needs - special casing for seq_cst, as all lesser orders are just achieved by virtue of the least strict instruction sequence possible for each operation.

@RalfJung
Copy link
Member

RalfJung commented Aug 18, 2024

As far as I know, all of the major compilers found it too hard to preserve dependencies through the optimization process,

I would phrase this as, nobody has even given a proper definition of what it would mean to preserve these dependencies. I consider memory_order_consume to have basically undefined semantics in C++ -- not in the sense of "UB", but in the sense of "these words you are using here do not actually have any precise meaning".

So in that sense I don't think any code can use memory_order_consume, no matter the language it is written in, making the question rather moot.

If you believe that there is a well-defined meaning to a C++ program with memory_order_consume (I think there isn't), then since Rust code also follows the C++ memory model, the combined program behaves as if it was one big program written in the C++ memory model. So FFI doesn't really change anything here, I think.

A more targetted question is whether replacing the C++ code with assembly code on, e.g. ARM, that just does a default load, is correct. Such a load can't be considered to be an acquire load, but it's UB if the load in the Rust AM is only a Relaxed load.

If you just have the "consume" load in the assembly, then I don't think that's any better than pretending that memory_order_conume has a well-defined semantics.

If you have a larger assembly block that has both the "consume" load and the other operation that depends on it, and you can argue in the ARM memory model that because of the dependency of the 2nd operation on the first this is properly synchronized, that should be fine.

[...] would the consume behaviour (dependency-ordered before) follow into rust code that executes after?

I this the answer to that is "no" even with inline assembly, for the reasons spelled out above: there is no well-defined notion of "dependency-ordered before" in Rust (or C++) so it can't flow into the Rust code after the inline assembly code either. This relation can only exist inside a single inline asm block.

@chorman0773
Copy link
Contributor Author

I'd argue that dependency-ordered before is itself well-defined (in that it has semantics that can be determined), but compiler vendors have decided (rightsly so, also in my opinion) that properly implementing it is a right pain in the ass, so universally elect to implement as the stronger memory_order_acquire. https://eel.is/c++draft/intro.races#7 and https://eel.is/c++draft/intro.races#8 (which together define the behaviour of a consume operation) are both fairly clear, in my opinion, as is https://eel.is/c++draft/intro.races#9 (which uses dependency-ordered before.

f you believe that there is a well-defined meaning to a C++ program with memory_order_consume (I think there isn't), then since Rust code also follows the C++ memory model, the combined program behaves as if it was one big program written in the C++ memory model. So FFI doesn't really change anything here, I think.

Given this, would it be fair to say that, assuming my position about the definition of the operation itself, it would be well-defined then?

@RalfJung
Copy link
Member

RalfJung commented Aug 18, 2024

Given this, would it be fair to say that, assuming my position about the definition of the operation itself, it would be well-defined then?

As I said, I don't think FFI changes anything. But I can't help you figure out how memory_order_consume interacts with anything since I don't think it is well-defined to begin with. (Or, well, there is of course a simple way to define a syntactic notion of "dependency" on the source code. It's just entirely meaningless and hence may as well not exist. I'm not very interested in spending my time discussing models that clearly do not model reality. "All models are wrong, but some are useful" -- I think this one is not useful.)

@chorman0773
Copy link
Contributor Author

chorman0773 commented Aug 18, 2024

As to how it interacts with compilers existing, I don't think there's an issue there.

I can imagine two scenarios here:

  1. The Rust code and the C++ code are not subject to cross-language LTO. Thus (assuming, without loss of generality, the Rust code was compiled using rustc with the llvm backend) the call would be opaque, and subject to the rules of the target ISA + the correspondance from the Rust AM. Less formally, llvm must assume that get_important_data can do any legal operation, which includes doing an acquire operation (rather than a consume operation) on IMPORTANT_DATA, or producing a freshly malloced (or better, __rust_alloced) pointer that has nothing to do with the value stored to IMPORTANT_DATA (and its acquired by another thread, or elsewise).
  2. The code is subject to cross-lang lto, which (under the same assumption) means that get_important_data() was compiled using clang or another C++ compiler that generates llir. Assuming that C++ compiler is compliant, then that means that either llvm correctly implements its equivalent to memory_order_consume or the C++ compiler lowers the consume operation to an llvm acquire operation. Thus there's no correspondance problem here.

So there would be no correct compilation in any case that doesn't obey the AM-AM correspondance here - namely, that it would define that the Box::new in thread1 is dependency-ordered before (and thus happens-before) the deref in thread2.

(To be clear, the same reasoning would apply to any other tuple of (C++ frontend, C++ backend, Rust frontend, Rust backend) on the same target)

@comex
Copy link

comex commented Aug 18, 2024

Or, well, there is of course a simple way to define a syntactic notion of "dependency" on the source code. It's just entirely meaningless and hence may as well not exist. I'm not very interested in spending my time discussing models that clearly do not model reality.

To be clear, that is exactly how the C++ spec defines "dependency". So it's well-defined, just difficult to implement.

Less formally, llvm must assume that get_important_data can do any legal operation, which includes doing an acquire operation (rather than a consume operation) on IMPORTANT_DATA, or producing a freshly malloced (or better, __rust_alloced) pointer that has nothing to do with the value stored to IMPORTANT_DATA (and its acquired by another thread, or elsewise).

But that's not the issue. Let me be more explicit. You may understand all of this already, but for clarity if nothing else…

In your example, as compiled by a typical optimizing compiler, the dependency ordering would typically work in practice. What follows is a modified example where it would not. The key difference is that instead of loading a pointer and then loading from that pointer, we load an integer and involve it in pointer arithmetic, but in a way that's trivially optimized out:

let ptr = addr_of!(OTHER_DATA).wrapping_add(important_data).wrapping_sub(important_data);

Full example:

use std::sync::atomic::{AtomicUsize, Ordering};
use std::ptr::addr_of;

#[no_mangle]
pub static IMPORTANT_DATA: AtomicUsize = AtomicUsize::new(0);
static mut OTHER_DATA: i32 = 0;

#[no_mangle]
unsafe fn thread1() {
    OTHER_DATA = 10;
    IMPORTANT_DATA.store(20, Ordering::Release);
}

extern "C" {
     fn get_important_data() -> usize;
}

#[no_mangle]
unsafe fn thread2() {
     let important_data = get_important_data();
     assert_eq!(important_data, 20);
     let ptr = addr_of!(OTHER_DATA)
        .wrapping_add(important_data)
        .wrapping_sub(important_data);
     let val = *ptr;
     println!("{val}");
}
extern "C" {
    extern std::atomic<size_t>) IMPORTANT_DATA;
    size_t get_important_data() {
       return IMPORTANT_DATA.load(std::memory_order::consume);
    } 
}

For this to be safe, the load from IMPORTANT_DATA must not be reordered after the load from OTHER_DATA, by either the compiler or the hardware.

We can easily be assured that the compiler won't reorder them, because:

  • Without LTO inlining, the only way the compiler could reorder them is by moving the entire function call downward, which is illegal for several reasons.
  • With LTO inlining, the consume load will act as a compiler barrier.

However, to prevent the hardware from reordering the loads, on a typical non-x86 architecture, one of the following needs to be true:

  • A memory barrier instruction is sequenced between the two load instructions1; or
  • The address of the second load has a value dependency, through a chain of assembly instructions, on the result of the first load.

At a source level (syntactically), the address of the second load does have a value dependency on the result of the first load. But a typical optimizer will remove .wrapping_add(offset).wrapping_sub(offset), and then there is no more assembly-level dependency. That means that there needs to be a barrier instruction.

The way Clang solves this is by having get_important_data emit a barrier after its load. This is the intended implementation strategy for acquire ordering, and in practice it's also the implementation strategy for consume ordering.

[Edit: The following is wrong; see my later comment.]

With consume ordering implemented as originally designed, however, the barrier would probably be located not in the callee get_important_data, but in the caller thread2. I haven't found an authoritative source fully describing the intended implementation strategy, but the example in [dcl.attr.depend] gives some clues. It seems like the compiler of a function caller would be required to track whenever there might be a source-level dependency chain from the call's return value to any subsequent load. If so, then the compiler would need to either emit a memory barrier after the call, or preserve the dependency chain into assembly. However, the latter is only an option if the function is marked with [[carries_dependency]]. If [[carries_dependency]] is not present, then a barrier is the only option.

To be frank, that seems completely insane. Given that dependency chains can be maintained through memory, I don't see how a compiler could avoid having to emit a barrier before and after almost every external function call. Maybe I'm missing something. Or maybe that's why nobody was willing to implement this scheme.

But the details don't matter for our purposes. What matters is that rustc's LLVM output does not follow this scheme. It never preserves dependency chains. It never emits barriers before or after external function calls. So if a C++ compiler implements consume in a way that puts any such requirements on function callers, and rustc output makes FFI calls into that compiler's output, then dependency chains are just not going to be preserved properly.

(In theory rustc could add a compatibility mode for such a compiler that did emit barriers around every external function call, but that seems horrible.)

Therefore, if Rust made a language-level guarantee that dependency chains are preserved, it would be tantamount to assuming that rustc will never need to interoperate with such a C++ compiler.

Footnotes

  1. Alternately, the first load can use a dedicated load-acquire instruction if supported, as on AArch64.

@chorman0773
Copy link
Contributor Author

Fair, though here specifically the function doesn't have [[carries_dependency]] and the attribute comes with the standard caveat of "If declared/defined in one TU with the attribute, and declared in another without, IF-NDR".

@comex
Copy link

comex commented Aug 19, 2024

Well, that is the case where I was suggesting the caller would need a barrier.

But now I realize I was wrong. Sigh.

I have just reread one of the original proposals for consume ordering, and I realize that I misinterpreted the previously-linked example from the spec. It's still not spelled out very explicitly, but I am pretty sure the actual intended strategy is the following:

The compiler tracks which expressions 'carry dependencies' (loosely defined as "might be based on a memory_order_consume load"). An expression carries dependencies if:

  • It is a memory_order_consume load;
  • It is an incoming function argument and the argument is marked [[carries_dependency]];
  • It is a function call expression and the target function is marked [[carries_dependency]]; or
  • It is computed from another expression that carries dependencies.

If an expression that carries dependencies is:

  • passed as a non-[[carries_dependency]] argument;
  • returned from a non-``[[carries_dependency]]` function; or
  • optimized in a way that breaks the dependency chain,

then the compiler emits a memory barrier at that point.

This is much less insane.

And under this strategy, as long as Rust does not support FFI into or out of extern functions whose corresponding C++ declarations include [[carries_dependency]], it becomes impossible to get a dependency-carrying expression into Rust code, so rustc never has to do anything special with optimizations or barriers.

In particular, in the get_important_data examples (both the original and my version), it is the callee, get_important_data, that would include a barrier. It's just that the barrier would be placed not after the consume load, but at the function return (which happens to be equivalent in this case).

So I changed my mind. Making guarantees about your example is probably fine. Still, it's hard to be sure without seeing any concrete implementations that don't just treat consume as acquire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants