Implement atomics wait/notify with C++20 runtime support #2268

shravanrn · 2023-07-02T12:50:04Z

Implementing await/notify using runtime support with some C++20. This PR contains the following changes.

Implement opcodes in wasm2c for wait/notify.
atomic Wait/notify with C++20 wait/notify primitives. Since C++ wait/notify doesn't support timeouts, we needed to have some extra code to implement this. We have implemented this as a separate timer thread.
Continue building wabt with C++17 and dummy implementations of wait/notify that simply abort when called.
Add an option to cmake BUILD_WASM2C_THREAD_WAIT_NOTIFY which will build wabt with C++ 20 and include the wait/notify runtime support in C++20
Enable all spec testsuite atomic tests
Add more tests for atomic wait/notify which are not covered in the spec testsuite as they require simultaneous running wasm modules
Modify the spec test scripts to support simultaneous running wasm modules (needed by previous point) and to compile the runtime wait/notify support with C++20 compilers.

Note -- build defaults to including dummy wait/notify implementation so that consumers don't need C++ 20 yet. Test suite defaults to using C++20 so we can continue to test the wait/notify runtime support.

shravanrn · 2023-07-03T13:47:55Z

Edit: all tests passing now

Looks like I have to fix some asan build settings. Will fix. Please feel free to start code reviews while I fix this though --- these fixes are should only require minor changes with build scripts.

shravanrn · 2023-07-12T01:18:13Z

Ping: reminder to have a look when you have a chance :)

sbc100

Looks like a very reasonable approach!

lgtm, although didn't have time to digest the changes to the test runner yet. Everything else lgtm % comments

wasm2c/wasm-rt-threads-impl.cpp

wasm2c/wasm-rt-threads.h

shravanrn · 2023-07-12T22:47:05Z

I have fixed the comments.

I've also tested that this works with the wasi-thread implementation
https://github.com/shravanrn/wasm2c_pthread_test

shravanrn · 2023-07-30T13:53:37Z

@sbc100 @keithw any chance you could take a look at this pr?

keithw · 2023-07-30T21:52:42Z

I guess I have some higher-level questions... feel free to tell me if this is overkill. I should admit I'm basically terrified of threads and the complexities they introduce.

libstdc++ seems to show how atomic wait and notify can be implemented without dynamic data structures, without a limit on the number of waiters, and without spawning a separate thread. (https://developers.redhat.com/articles/2022/12/06/implementing-c20-atomic-waiting-libstdc) Is this approach just overkill for our purposes, or not feasible or not worth it, or...?

FWIW, somebody seems to have produced a plausible-looking wait-with-timeout on top of std::atomic and std::counting_semaphore<>::try_acquire_until() (https://stackoverflow.com/questions/69660148/c20-how-to-wait-on-an-atomic-object-with-timeout).
Do you think we should do something like wasm2c: we need to pick standard C data structure implementations to finish thread proposal #2258 (comment) to be Wasm-correct in terms of the treatment of non-atomic operations in shared memory? (E.g. make non-atomic shared-memory loads and stores use C11 relaxed atomics.) I don't love the idea of introducing UB into the wasm2c output.
I think you're totally right that the threads proposal would benefit from some tests with multiple threads executing in parallel, but I don't relish us creating and taking ownership of all that ourselves (the syntax to define the parallel tests, the actual parallel tests themselves, executing the syntax in the test runner, etc.). I assume we'll eventually want more parallel tests than just "instantiate each module in parallel and run their respective commands 2 seconds apart." Do you have any interest in working with upstream on whatever their plan is for parallel tests, and we can implement whatever you/they pick along with everybody else?
It would be nice to document the new functions in the runtime API (i.e. what's expected from the runtime).

shravanrn · 2023-08-01T18:27:45Z

libstdc++ seems to show ... Is this approach just overkill for our purposes, or not feasible or not worth it, or...?

A mix of the above. So i did have a look at that link in detail prior to the implementation. On the first parts on things like "how many spinlock iterations" before calling sleep, these are things that should happen at the wasm's libc level, and so would be transparent to us. However, in case the value change doesn't happen during the duration of the spinlock of Wasm libc, then Wasm libc has to fall back to the Wasm platform wait/notify primitives.

Re the bits about avoid datastructures. Unfortunately this is not applicable to Wasm. libc++ can do this because they have to "implement a wait/notify until an underlying value has changed, AND they allow spurious wakeups". The OS APIs like futex are well setup for this. Wasm on the other hand wants "a wait/notify that waits UNTIL NOTIFIED AND with no spurious wake ups". This is a large difference that causes issues. I'll walk this more below by contrasting how this looks like in libstdc++ wait/notify vs. Wasm's wait/notify.

Consider an example where want a Wasm implementation that avoids data strucutres. We want to use platform primitives like futex (or in our case, the host C++ runtime's atomic wait/notify support) to make this happen. However platform primitives has spurious wakeups. When this spurious wakeups happen, we need a test to see if this wake up is spurious. In libstdc++'s wait/notify, the test is simple --- check if the value has changed. Wasm does not have an easy test, as it is defined as "wake up only when you get a notify" --- there is no guarantee that checking that the underlying value has changed is the thing the program was waiting for. The only way to have a valid test, is to setup a separate data structure that bridges these differences. (This was the crux of my earlier comment about why I think this design was a mistake in the Wasm standards)

Also, another interesting data point here is when the platform primitives do not match what the exactly want, even libstdc++ has to resort to extra data structures. This is discussed under "How to handle those types that do not fit in a __platform_wait_t". In their case, the limitations of the platform APIs are to do with size of atomics (which is different from what we consider as limitations), but it nevertheless demonstrates why separate data structures are needed when the platform primitives do not exactly match what the spec demands.

FWIW, somebody seems to have produced a plausible-looking wait-with-timeout on top of std::atomic and std::counting_semaphore<>::try_acquire_until() (https://stackoverflow.com/questions/69660148/c20-how-to-wait-on-an-atomic-object-with-timeout).

Oh this is neat. I wasn't aware of try_acquire_until. I checked how this is implemented in libstdc++ https://developers.redhat.com/articles/2023/04/18/implementing-c20-semaphores#semaphores_in_c__ and it seems like they are relying on atomics where possible, which means this won't just spinlock and wreck system performance. I think something like this could allow me to kill the timer thread in this implementation. It won't eliminate extra datastructures though. I'll investigate this more and update this thread/PR appropriately.

Do you think we should do something like #2258 (comment) to be Wasm-correct in terms of the treatment of non-atomic operations in shared memory? (E.g. make non-atomic shared-memory loads and stores use C11 relaxed atomics.) I don't love the idea of introducing UB into the wasm2c output.

We may need to do something like that to be in compliance with the Wasm spec. The upside is that this is easily possible with the relaxed parameter to the existing compiler primitive atomics we rely on https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
I want to investigate further and talk to @conrad-watt before specifying the exact path forward for wasm2c here, but I can concretely say that I believe there is a way to implement this in wasm2c without UB, at comparable/equal performance to other Wasm engines, and without slowing down the single threaded wasm2c.

However, I would like to separate this bit into a new PR, as the current PR is mostly focused on the runtime. I expect at least 2 more PRs in this space before we get to full spec compliance --- one PR to add some spec required checks on the shared memory bit (shared memory has to have a max size etc.), and one PR to address this point of regular non-atomic load/stores to shared memory.

but I don't relish us creating and taking ownership of all that ourselves (the syntax to define the parallel tests, the actual parallel tests themselves, executing the syntax in the test runner, etc.).

On the fundamental philosophy of do we want to maintain tests --- At a high level, I do want to caution that I don't think we want to be in situation where we say the only tests we will execute are the upstream tests. I think is unrealistic for any production level wasm engine. Wamr for instance certainly has additional thread-wait-notify tests in their repo https://github.com/bytecodealliance/wasm-micro-runtime/tree/main/core/iwasm/libraries/lib-wasi-threads/test as does Wasmtime https://github.com/bytecodealliance/wasmtime/blob/72b87183ffc57d834737e9ad6d4f1c1967e559a9/tests/all/wait_notify.rs

On the point about taking ownership of syntax define the parallel tests, executing the syntax in the test runner --- while I would definitely agree if this test runner change was sufficiently complicated, it turns out that it's, simply put, not that hard. I haven't counted the linediff for just this part, but this PR basically shows that this is probably less than 50 lines of code, changes that are pretty easy to grok or build on. Summarizing, the basic change is "invocation of each test-function sequentially" --> "invoke each test-function using THREAD_CREATE" and finally adding a THREAD_JOIN at the end, and then adding one annotation to wasm2c custom test format under the test/*.txt path (apart from some minor things like make test counts atomic).

Ultimately, I don't think there is a lot here. But if you really feel strongly about this part, I am ok ripping out the extra tests

I assume we'll eventually want more parallel tests than just "instantiate each module in parallel and run their respective commands 2 seconds apart."

I don't have any tests in mind that need more than what this PR brings.
If we do want to expand this in the future, we can definitely consider if each such expansion is worth it, relative to the benefits of the test we will additionally run.

Do you have any interest in working with upstream on whatever their plan is for parallel tests, and we can implement whatever you/they pick along with everybody else?

Unfortunately, I do not have cycles for this at this time. Additionally, with the possible changes in the underlying thread proposal being discussed, I am wary of entering this space of the standards discussions, when the momentum and folks' attention is plainly in other parts of the thread proposal (resolving thread_create). Maybe at some point in the future. Ultimately, though, I think this is an orthogonal question --- we can still land these tests in this PR and separately pursue getting these tests included upstream in parallel or after.

shravanrn · 2023-08-02T17:55:05Z

Update: I had a chat with @conrad-watt and it looks like I can add to the above.

Apparently concurrent spec tests are much further along than I expected (and it exists, albeit on a non main branch of th spec test repo). It is expected to make it to the main branch by mid October. Given this, and @keithw's concerns above, I am happy to rip out the concurrent testing flag/syntax we have for now. We will likely need something very similar in the future (but with some changes) to accommodate the upstream syntax to support the concurrent spec tests. @conrad-watt also mentioned I could share any missing tests directly with him given that this is getting updated as we speak.

On the bit about non-atomic accesses, I have confirmed that we can go ahead and implement it as relaxed memory order and this is implementable in a way that won't introduce UB or performance issues in single or multi threaded wasm2c output for gcc/clang. I am still working on figuring out the path for msvc --- on msvc we can achieve spec compatibility, but maybe not optimal performance as I haven't yet figured out how to get memory_order_relaxed via compiler intrinsics, meaning we may have to employ a stronger memory order sacrificing performance. I will investigate further to see how to avoid this.

Finally, one idea prompted by discussion with @conrad-watt does perhaps reopen the possibility of relying on c11 atomic accesses instead of compiler intrinsics which would be much nicer. I will investigate if this path can be taken while avoiding UB in wasm2c, in parallel to these changes.

Proposed path forward

refactor this pr to (1) remove test runner changes and (2) (if possible) implement the timer approach linked by @keithw above as it looks cleaner
pr to have memory_order_relaxed non-atomic accesses to shared memory
pr to add some shared memory related checks for spec compliance (Atomic accesses only to shared memory etc.)
(if possible without UB) pr to replace the current "atomic accesses via compiler intrinsics" with "atomic accesses via c11 atomics"

keithw · 2023-08-04T04:27:24Z

That plan sounds pretty good to me!

shravanrn · 2023-10-02T01:57:26Z

Will create new PRs with proposed changes

shravanrn requested review from keithw and sbc100 July 2, 2023 12:50

shravanrn force-pushed the atomics_part2 branch from ddde279 to d697db4 Compare July 2, 2023 12:56

shravanrn force-pushed the atomics_part2 branch from d697db4 to 61cfc4c Compare July 3, 2023 14:31

sbc100 reviewed Jul 12, 2023

View reviewed changes

wasm2c/wasm-rt-threads-impl.cpp Outdated Show resolved Hide resolved

wasm2c/wasm-rt-threads-impl.cpp Outdated Show resolved Hide resolved

wasm2c/wasm-rt-threads.h Outdated Show resolved Hide resolved

shravanrn force-pushed the atomics_part2 branch from 61cfc4c to 7e2ae21 Compare July 12, 2023 22:46

Implement atomics wait/notify with C++20 runtime support

165ee1e

shravanrn force-pushed the atomics_part2 branch from 7e2ae21 to 165ee1e Compare July 30, 2023 19:27

shravanrn closed this Oct 2, 2023

shravanrn mentioned this pull request Oct 2, 2023

wasm2c: atomic and shared mem operations using c11 #2308

Merged

shravanrn mentioned this pull request May 15, 2024

wasm2c: partial support for atomic memory ops #2233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement atomics wait/notify with C++20 runtime support #2268

Implement atomics wait/notify with C++20 runtime support #2268

shravanrn commented Jul 2, 2023 •

edited

Loading

shravanrn commented Jul 3, 2023 •

edited

Loading

shravanrn commented Jul 12, 2023

sbc100 left a comment

shravanrn commented Jul 12, 2023 •

edited

Loading

shravanrn commented Jul 30, 2023

keithw commented Jul 30, 2023 •

edited

Loading

shravanrn commented Aug 1, 2023 •

edited

Loading

shravanrn commented Aug 2, 2023 •

edited

Loading

keithw commented Aug 4, 2023

shravanrn commented Oct 2, 2023

Implement atomics wait/notify with C++20 runtime support #2268

Implement atomics wait/notify with C++20 runtime support #2268

Conversation

shravanrn commented Jul 2, 2023 • edited Loading

shravanrn commented Jul 3, 2023 • edited Loading

shravanrn commented Jul 12, 2023

sbc100 left a comment

Choose a reason for hiding this comment

shravanrn commented Jul 12, 2023 • edited Loading

shravanrn commented Jul 30, 2023

keithw commented Jul 30, 2023 • edited Loading

shravanrn commented Aug 1, 2023 • edited Loading

shravanrn commented Aug 2, 2023 • edited Loading

keithw commented Aug 4, 2023

shravanrn commented Oct 2, 2023

shravanrn commented Jul 2, 2023 •

edited

Loading

shravanrn commented Jul 3, 2023 •

edited

Loading

shravanrn commented Jul 12, 2023 •

edited

Loading

keithw commented Jul 30, 2023 •

edited

Loading

shravanrn commented Aug 1, 2023 •

edited

Loading

shravanrn commented Aug 2, 2023 •

edited

Loading