Implement Fast CIOS for Montgomery modular multiplication #1009

clementjuventin · 2024-09-17T08:28:34Z

This PR implements the improvement of the CIOS variant of the Montgomery algorithm showcased in the following document: EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication and proposed in #869.
The optimization occurs when most_significant_p_word < (word_size / 2), p is represented by the variable mod in the current implementation.

Here are the benchmarks performed after applying these changes starting from commit 01eca77.

Build: cmake --build build --parallel

Benchmarks: taskset -c 0 ./build/bin/evmone-bench-internal --benchmark_filter=evmmax* --benchmark_repetitions=10 --benchmark_format=json --benchmark_out=cios_classic.json

Comparison:

user@user:~/sandbox/evmone$ python3 ../benchmark/tools/compare.py benchmarks cios_classic.json cios_improved.json 
Comparing cios_classic.json to cios_improved.json
Benchmark                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
------------------------------------------------------------------------------------------------------------------------------------
evmmax_mul<uint256, bn254>_pvalue                     0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, bn254>_mean                      +7.8884         +7.8873            27           238            27           238
evmmax_mul<uint256, bn254>_median                    +7.8595         +7.8584            27           237            27           237
evmmax_mul<uint256, bn254>_stddev                   +17.5405        +17.5057             0             3             0             3
evmmax_mul<uint256, bn254>_cv                        +1.0859         +1.0823             0             0             0             0
evmmax_mul<uint256, secp256k1>_pvalue                 0.0002          0.0002      U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, secp256k1>_mean                  +7.6644         +7.6636            29           249            29           249
evmmax_mul<uint256, secp256k1>_median                +7.7730         +7.7721            28           249            28           249
evmmax_mul<uint256, secp256k1>_stddev               +10.6559        +10.6522             1             7             1             7
evmmax_mul<uint256, secp256k1>_cv                    +0.3453         +0.3450             0             0             0             0
OVERALL_GEOMEAN                                      +1.0661         +1.0658             0             0             0             0

As you can see, the results do not indicate a performance improvement.

clementjuventin · 2024-09-19T15:42:20Z

After further investigation, I found another way of ordering branches (second commit) and obtained what we were looking for (~15% gain on evmmax_mul<uint256, bn254>)!

evmmax_mul<uint256, bn254>_median                    -0.1529         -0.1529            27            23            27            23
evmmax_mul<uint256, secp256k1>_median                +0.0059         +0.0058            28            28            28            28

I still don't get why this implementation is better, even assuming branch prediction makes the checks insignificant.
I would also like to compare the assembly code in case there is important optimization under the hood but I never did so let's see if I manage to do it

clementjuventin force-pushed the fast-montgomery-cios branch from b693f02 to 79ad43b Compare September 19, 2024 15:33

improve CIOS implementation

1d3aab1

clementjuventin force-pushed the fast-montgomery-cios branch from 79ad43b to 1d3aab1 Compare September 19, 2024 15:58

improve CIOS implementation - second approach

2c9d6f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Fast CIOS for Montgomery modular multiplication #1009

Implement Fast CIOS for Montgomery modular multiplication #1009

clementjuventin commented Sep 17, 2024 •

edited

Loading

clementjuventin commented Sep 19, 2024 •

edited

Loading

Implement Fast CIOS for Montgomery modular multiplication #1009

Are you sure you want to change the base?

Implement Fast CIOS for Montgomery modular multiplication #1009

Conversation

clementjuventin commented Sep 17, 2024 • edited Loading

clementjuventin commented Sep 19, 2024 • edited Loading

clementjuventin commented Sep 17, 2024 •

edited

Loading

clementjuventin commented Sep 19, 2024 •

edited

Loading