Should something for 128-bit division or wide division be included? #15

alexcrichton · 2024-09-05T16:12:46Z

Currently this proposal's overview states that division on native platforms always goes to __udivti3 so it's probably not worth adding anything. Additionally the overview cites that while x64 has the ability to divide a 128-bit number by a 64-bit number that's not what 128-bit division in wasm wants in theory.

Despite these two claims benchmarking shows that bignum division is significantly slower in WebAssembly than it is in native. Later benchmarking has shown that with a prototype version of this proposal the gap can be significantly reduced, but it's still quite far behind native.

I've done some further investigation and found that the Rust compiler's implementation of __udivti3 wasn't optimal for wasm and native architectures were using a different algorithm. After updating the algorithm used and enabling a prototype for this proposal on x64 the performance gap is that wasm is now 36% behind native. This is significantly closer than previous attempts (yay!)

Given all this I'd like to open this issue to track further investigation of this. For example are there remaining low-hanging fruit in this 36% that can be easily picked by slightly changing this proposal or adding a new opcode? I'm not sure at this time!

The text was updated successfully, but these errors were encountered:

AaronKutch · 2024-09-12T19:27:51Z

I wrote the algorithms used in compiler-builtins. The delegate algorithm is used on many targets as the default for 128 bit division. The trifecta algorithm is the fastest algorithm on architectures that have fast 64 to 128 bit widening multiplication and a hardware 64-by-64 divide (it was originally configured by just checking the pointer width, but I forgot about webassembly and Alex just changed it to also be enabled on webassembly which is probably the best choice for now). The asymmetric algorithm configured on x86-64 takes advantage of a special assembly instruction that has a 128-by-64 divide, and expands it to the full 128-by-128. So, the remaining performance gap is because asymmetric is faster on x86-64 but requires assembly. The reference crate for the division code used in compiler-builtins is https://crates.io/crates/specialized-div-rem .

alexcrichton · 2024-09-12T20:13:59Z

Thanks for the info and links @AaronKutch!

One thing I can clarify is that in my testing I was not testing __udivti3 with this proposal enabled, so the trifect algorithm in compiler-builtins didn't have access to the widening 64-to-128 multiplication instructions. That probably explains the remaining 36% performance gap, and I'll try to confirm this with more testing. (or I did test this and I'm misremembering...)

My hypothesis is that if compiler-builtins is built with this proposal (which has widening multiplication), and I test with the updated version that's using the trifecta algorithm, then I should see a small performance loss on x86_64 due to not having access to 128-by-64 divides but close to no performance loss on aarch64 which does not have a native 128-by-64 divide.

I'll work on measuring this hypothesis.

alexcrichton · 2024-09-13T20:48:51Z

Upon re-measuring I'm seeing the following numbers. This is with an updated division algorithm in Rust's compiler-builtins:

Baseline wasm - 223% slower-than-native
This proposal enabled for the benchmark, but not compiler-builtins - 36% slower-than-native
This proposal enabled for both the benchmark and compiler-builtins - 39% slower-than-native

So it looks like this benchmark in particular is not stressing these paths of __udivti3. The slowdown of 36-to-39 when optimizing compiler-builtins as well might just be noise. I confirmed that __udivti3 is using instructions from this proposal, however.

alexcrichton · 2024-09-20T21:29:55Z

Following up on this with arm64 numbers, I've measured now that with everything updated an M2 is 15% slower-than-native in this division benchmark. That follows the intuition here I think where 15% is probably "built in wasmtime overhead" and the 20% difference with x64 is likely that x64 has access to a native instruction wasm can't use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should something for 128-bit division or wide division be included? #15

Should something for 128-bit division or wide division be included? #15

alexcrichton commented Sep 5, 2024

AaronKutch commented Sep 12, 2024

alexcrichton commented Sep 12, 2024

alexcrichton commented Sep 13, 2024

alexcrichton commented Sep 20, 2024

Should something for 128-bit division or wide division be included? #15

Should something for 128-bit division or wide division be included? #15

Comments

alexcrichton commented Sep 5, 2024

AaronKutch commented Sep 12, 2024

alexcrichton commented Sep 12, 2024

alexcrichton commented Sep 13, 2024

alexcrichton commented Sep 20, 2024