Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should something for 128-bit division or wide division be included? #15

Open
alexcrichton opened this issue Sep 5, 2024 · 4 comments
Open

Comments

@alexcrichton
Copy link
Collaborator

Currently this proposal's overview states that division on native platforms always goes to __udivti3 so it's probably not worth adding anything. Additionally the overview cites that while x64 has the ability to divide a 128-bit number by a 64-bit number that's not what 128-bit division in wasm wants in theory.

Despite these two claims benchmarking shows that bignum division is significantly slower in WebAssembly than it is in native. Later benchmarking has shown that with a prototype version of this proposal the gap can be significantly reduced, but it's still quite far behind native.

I've done some further investigation and found that the Rust compiler's implementation of __udivti3 wasn't optimal for wasm and native architectures were using a different algorithm. After updating the algorithm used and enabling a prototype for this proposal on x64 the performance gap is that wasm is now 36% behind native. This is significantly closer than previous attempts (yay!)

Given all this I'd like to open this issue to track further investigation of this. For example are there remaining low-hanging fruit in this 36% that can be easily picked by slightly changing this proposal or adding a new opcode? I'm not sure at this time!

@AaronKutch
Copy link

I wrote the algorithms used in compiler-builtins. The delegate algorithm is used on many targets as the default for 128 bit division. The trifecta algorithm is the fastest algorithm on architectures that have fast 64 to 128 bit widening multiplication and a hardware 64-by-64 divide (it was originally configured by just checking the pointer width, but I forgot about webassembly and Alex just changed it to also be enabled on webassembly which is probably the best choice for now). The asymmetric algorithm configured on x86-64 takes advantage of a special assembly instruction that has a 128-by-64 divide, and expands it to the full 128-by-128. So, the remaining performance gap is because asymmetric is faster on x86-64 but requires assembly. The reference crate for the division code used in compiler-builtins is https://crates.io/crates/specialized-div-rem .

@alexcrichton
Copy link
Collaborator Author

Thanks for the info and links @AaronKutch!

One thing I can clarify is that in my testing I was not testing __udivti3 with this proposal enabled, so the trifect algorithm in compiler-builtins didn't have access to the widening 64-to-128 multiplication instructions. That probably explains the remaining 36% performance gap, and I'll try to confirm this with more testing. (or I did test this and I'm misremembering...)

My hypothesis is that if compiler-builtins is built with this proposal (which has widening multiplication), and I test with the updated version that's using the trifecta algorithm, then I should see a small performance loss on x86_64 due to not having access to 128-by-64 divides but close to no performance loss on aarch64 which does not have a native 128-by-64 divide.

I'll work on measuring this hypothesis.

@alexcrichton
Copy link
Collaborator Author

Upon re-measuring I'm seeing the following numbers. This is with an updated division algorithm in Rust's compiler-builtins:

  • Baseline wasm - 223% slower-than-native
  • This proposal enabled for the benchmark, but not compiler-builtins - 36% slower-than-native
  • This proposal enabled for both the benchmark and compiler-builtins - 39% slower-than-native

So it looks like this benchmark in particular is not stressing these paths of __udivti3. The slowdown of 36-to-39 when optimizing compiler-builtins as well might just be noise. I confirmed that __udivti3 is using instructions from this proposal, however.

@alexcrichton
Copy link
Collaborator Author

Following up on this with arm64 numbers, I've measured now that with everything updated an M2 is 15% slower-than-native in this division benchmark. That follows the intuition here I think where 15% is probably "built in wasmtime overhead" and the 20% difference with x64 is likely that x64 has access to a native instruction wasm can't use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants