Improve `imulc` Performance #21

nlordell · 2022-09-18T22:19:00Z

Checked signed integer multiplication (imulc and I256::checked_mul) use division to verify overflows. This is, I think, quite inefficient and unnecessary.

This issue captures the work to:

Benched checked signed multiplication
Investigate and implement possible alternative algorithms

One idea would be to multiply absolutes and then verify the sign afterwards. I'm not sure if this will work for I256::overflowing_mul (in that the wrapping produces different results), but it might be worth a try!

The text was updated successfully, but these errors were encountered:

NCGThompson · 2023-12-02T14:09:22Z

As expected, imulc has abysmal performance. On Kaby Lake, I256::overflowing_mul benches 20 to 27 times slower than wrapping_mul. According to uops.info, on Ice Lake up, 128-bit / 64-bit DIV runs with 1/2 to 1/3 the rthroughput, so it should be better. I can't find similar data for non-intel chips. Interestingly, with or without llvm-intrinsics, U256::overflowing_mul benches about twice the time of U256::wrapping_mul, so we should probably manage our expectations. I might take a look with mca to see where that extra time is coming from.

Since there is no llvm-intrinsics function for imulc, the slow native function is also ran with llvm-intrinsics. Consider adding an intrinsic. Also,

https://github.com/nlordell/ethnum-rs/blob/b66678b45c9f935930a95aac8b0e379b9b7d3dea/src/intrinsics/native/mul.rs#L87C2-L87C2

can trivially be replaced with

let r = abs_a.as_u256().overflowing_mul(abs_b.as_u256());
r.1 || r.0 > I256::MAX.as_u256()

The redundant multiplication is tiny next to a division.

NCGThompson · 2023-12-03T05:34:29Z

tl;dr: nlordell's idea of verifying the sign afterwards should work and his concern of wrapping problems is unfounded. There is a tiny caveat with I256::MIN.

The advantage of two's complement is that the values wrap around correctly, and are congruent modulo 2^bits to their binary representation reinterpreted as an unsigned integer. This means that we can do modulo arithmetic on congruence classes and apply it to signed and unsigned integers simultaneously.

imulc is given two integers, a̅₂₅₆ and b̅₂₅₆. One of our outputs, c̅₂₅₆, is the modular product of our two inputs. The bool output overflow, on the other hand, is partial to whether or not we are interpreting the values as signed or unsigned.

Let's define functions that interpret the congruence classes as regular integers. signed(x) interprets x as a signed integer while unsigned(x) interprets x as an unsigned integer. So for example signed(-1̅0₂₅₆) = -10 while unsigned(-1̅0₂₅₆) = 2²⁵⁶ - 10.

overflow should return positive iff signed(a̅₂₅₆) * signed(b̅₂₅₆) ≠ signed(c̅₂₅₆). Equivalently, overflow should return false iff signed(I256::MIN) ≤ signed(a̅₂₅₆) * signed(b̅₂₅₆) ≤ signed(I256::MAX).

One idea would be to multiply absolutes and then verify the sign afterwards.

That makes sense. If umulc(...).1 of the absolute values returns true, then we know overflow is true. I'm not going to bother to explain that with math symbols. If umulc didn't overflow and signed(product of absolutes) is non-negative, then we know overflow is false. If the product is negative, then overflow is true unless the product is I256::MIN and exactly one of our inputs is negative.

I'm not sure if this will work for I256::overflowing_mul (in that the wrapping produces different results)

We wan't to be able to "undo" the absolute value at the end by negating the product if exactly one of the inputs was negative. So the question is whether or not c̅₂₅₆ ≡ 0̅₂₅₆ - (0̅₂₅₆ - a̅₂₅₆) * b̅₂₅₆ and c̅₂₅₆ ≡ (0̅₂₅₆ - a̅₂₅₆) * (0̅₂₅₆ - b̅₂₅₆)

With modulo arithmetic, congruence is preserved with addition, subtraction, and multiplication. Therefor we can rewrite the question more simply: "Does a * b = -(-a * b) and a * b = -a * - b?" This is clearly true.

NCGThompson · 2024-01-01T03:29:50Z

This is how llvm does it. This function is called on Aarch64 for i128 mul. https://github.com/llvm/llvm-project/blob/main/compiler-rt/lib/builtins/int_mulo_impl.inc

nlordell added good first issue Good for newcomers and removed good first issue Good for newcomers labels Sep 18, 2022

NCGThompson mentioned this issue Dec 2, 2023

Added Optional Multiplication Benchmarks #39

Open

NCGThompson mentioned this issue Dec 3, 2023

Replaced some overflowing muls with wrapping muls #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `imulc` Performance #21

Improve `imulc` Performance #21

nlordell commented Sep 18, 2022

NCGThompson commented Dec 2, 2023

NCGThompson commented Dec 3, 2023

NCGThompson commented Jan 1, 2024

Improve imulc Performance #21

Improve imulc Performance #21

Comments

nlordell commented Sep 18, 2022

NCGThompson commented Dec 2, 2023

NCGThompson commented Dec 3, 2023

NCGThompson commented Jan 1, 2024

Improve `imulc` Performance #21

Improve `imulc` Performance #21