Fixed-point matrix multiplication improvements #2062

fredrik-johansson · 2024-09-05T12:56:10Z

We define add_ss* / sub_dd* to exist all the way up to 8 limbs for all architectures. Note that this is a bit iffy, as the C fallbacks might produce poor assembly depending on the compiler and the inline asm versions can exhaust the register allocation. On x86-32 I had to switch to C fallbacks for the longer macros. I think with 7-8 operands failures are possible on x86-64 too depending on how the macros are used, though the uses in the current codebase don't seem to hit this limit. All of this trouble because compilers don't understand carry flags, sigh.
Fixed-point matrix multiplication is optimized for medium size matrices by using dot products with inlined code and by incorporating Strassen multiplication, along with new tuning values. The internal nfixed representation is also made semi-public.

I will post some more notes about possible improvements in a followup issue.

Speedup for nfloat_mat_mul (with uniform matrices) in this PR:

prec \ n

          2     3     4     8    16    24    32    48    64    80    96   128   144   256   512  1024 
   64  1.13  1.04  1.05  1.04  1.03  2.05  1.30  1.08  0.98  1.25  1.04  1.01  1.01  1.01  1.04  1.00 
  128  1.06  1.02  1.00  1.02  0.98  1.51  1.20  1.03  0.99  1.06  1.00  0.98  0.99  1.00  1.00  0.99 
  192  0.99  1.01  0.99  1.29  1.48  1.82  1.44  1.38  1.34  1.40  1.07  1.01  0.98  1.00  0.99 
  256  1.01  1.01  1.01  1.04  1.11  1.30  1.18  1.16  1.19  1.24  0.99  0.99  1.00  0.99  0.99 
  320  1.03  1.08  1.03  1.10  1.03  1.40  1.31  1.26  1.32  1.33  1.13  1.05  0.98  1.02  1.00 
  384  1.02  1.00  1.04  1.08  0.99  1.23  1.24  1.21  1.26  1.30  1.13  1.03  1.01  1.01  1.01 
  448  1.05  0.98  1.02  1.07  0.94  1.10  1.13  1.07  1.15  1.17  1.11  1.06  1.03  1.02  1.00 
  512  0.98  0.95  0.99  0.97  0.99  1.00  1.00  1.02  1.07  1.10  1.11  1.07  1.06  1.02  1.00 
  576  0.99  0.99  0.99  0.98  1.03  1.01  1.03  1.08  1.13  1.18  0.97  0.96  0.93  1.04  1.01 
  640  0.99  1.02  1.00  1.03  1.02  1.03  1.05  1.08  1.15  1.10  1.02  0.93  0.96  1.02  1.02 
  704  0.99  0.99  0.97  1.00  1.00  1.00  1.03  1.06  1.13  1.17  1.06  1.00  0.98  1.04  0.99 
  768  0.97  0.99  0.98  0.98  1.00  0.99  1.01  1.05  1.12  1.16  1.07  1.02  0.99  1.04  1.01 
  832  1.00  1.01  0.98  0.99  1.02  0.99  1.04  1.05  1.17  1.17  1.06  1.02  0.98  1.03 
  896  1.00  0.98  1.03  1.00  1.00  1.00  1.05  1.07  1.15  1.19  1.14  1.06  1.04  1.04 
  960  0.99  0.99  0.99  1.01  0.99  1.00  1.02  1.07  1.19  1.17  1.17  1.07  1.07  1.02 
 1024  1.00  0.94  0.98  0.99  0.99  0.99  1.02  1.06  1.16  1.18  1.18  1.14  1.06  0.99 
 1536  0.99  0.99  0.99  0.99  1.00  0.99  1.05  1.09  1.17  1.19  1.15  1.05  0.99  1.00 
 2048  0.97  0.99  0.99  0.99  0.99  0.99  1.06  1.13  1.16  1.18  1.11  1.05  1.02  1.01 
 2560  0.99  0.99  0.99  0.99  0.99  0.99  1.05  1.08  1.17  1.17  1.05  1.00  1.00 
 3072  1.00  0.99  0.99  0.98  0.98  0.99  1.06  1.08  1.17  1.20  0.99  0.99 
 3584  0.99  0.98  0.99  0.99  0.99  0.99  1.06  1.07  1.19  1.09  1.02  0.99 
 4096  0.97  0.94  0.95  0.93  0.94  0.94  1.01  1.03  1.09  1.14  0.98  0.97

Speedup of nfloat_complex_mat_mul:

          2     3     4     8    16    24    32    48    64    80    96   128   144   256   512  1024 
   64  0.95  0.98  1.00  1.13  1.40  1.15  1.16  1.18  1.24  1.00  1.02  1.05  0.98  0.97  1.00  1.01 
  128  0.98  1.00  1.02  1.17  1.37  1.23  1.20  1.23  1.26  1.21  1.17  1.15  1.24  1.02  1.06 
  192  0.96  0.96  1.15  1.46  1.61  1.49  1.42  1.45  1.46  1.22  1.20  1.16  1.20  1.06  1.08 
  256  0.99  1.02  1.11  1.13  1.29  1.19  1.21  1.23  1.31  1.09  1.08  1.15  1.23  1.24  1.06 
  320  1.02  1.04  1.06  1.11  1.35  1.20  1.20  1.23  1.31  1.36  1.36  1.48  1.48  1.05  1.07 
  384  0.99  0.99  1.04  1.08  1.34  1.22  1.20  1.23  1.30  1.29  1.35  1.41  1.40  1.11  1.07 
  448  0.98  0.94  0.98  1.05  1.25  1.11  1.11  1.10  1.18  1.20  1.22  1.31  1.31  1.08  1.08 
  512  1.00  1.00  1.00  0.99  0.98  0.99  0.99  1.03  1.06  1.12  1.14  1.20  1.21  1.12  1.07 
  576  0.98  0.99  0.99  1.04  1.02  1.03  1.03  1.08  1.14  1.18  1.21  1.17  1.17  1.15  1.06 
  640  1.02  0.99  1.02  1.02  1.03  1.02  1.05  1.10  1.14  1.19  1.23  1.21  1.18  1.13 
  704  0.99  0.96  0.99  0.99  1.01  0.98  1.01  1.08  1.12  1.17  1.19  1.27  1.19  1.10 
  768  1.02  0.96  1.01  1.02  0.97  1.00  1.03  1.07  1.12  1.17  1.21  1.27  1.20  1.06 
  832  1.01  0.96  1.02  1.00  1.00  0.97  1.04  1.10  1.16  1.19  1.22  1.24  1.18  1.07 
  896  1.01  0.98  1.01  1.00  1.01  0.99  1.05  1.09  1.14  1.18  1.22  1.28  1.21  1.04 
  960  1.00  0.97  1.01  1.01  1.00  1.00  1.04  1.10  1.14  1.19  1.20  1.31  1.21  1.03 
 1024  1.01  0.98  1.00  1.00  0.98  1.00  1.04  1.09  1.16  1.19  1.22  1.28  1.19  1.01 
 1536  1.00  1.01  0.99  1.00  0.99  1.00  1.06  1.09  1.17  1.21  1.22  1.14  1.07  1.01 
 2048  1.00  0.99  1.00  1.00  1.00  1.01  1.06  1.09  1.15  1.18  1.20  1.02  1.03  1.02 
 2560  1.00  1.00  0.99  1.00  1.01  1.01  1.06  1.10  1.17  1.21  1.21  1.02  1.02 
 3072  0.99  1.00  1.00  1.00  1.00  0.99  1.07  1.09  1.18  1.19  1.20  0.99 
 3584  1.01  1.00  0.99  0.99  1.00  0.99  1.06  1.08  1.18  1.09  0.99  1.00 
 4096  0.99  0.95  0.96  0.95  0.96  0.96  1.07  1.09  1.18  0.99  1.00  1.00

… call classical directly

fredrik-johansson added 16 commits August 19, 2024 16:14

some more test code

d986efd

define add_sssssss... everywhere

3beb2d7

msc fix

80df284

test code

20b28e8

try to fix 32-bit x86

d238c29

matrix multiplication wip

4676718

bug fixes and optimizations

e429166

more matrix multiplication cleanup and improvements

9293b5c

retune real matrix mul

d247cc1

modified tunings

e591c83

tuning improvements

2ab1b05

Merge remote-tracking branch 'flintlib/main' into matmul

5efbeb2

tuning fix

6c41433

profiling code tweak

016cb3e

test code tweak

2fd3799

_nfixed_mat_mul_strassen: avoid repeated operations in odd dimension;…

2737830

… call classical directly

fredrik-johansson merged commit 6c38679 into flintlib:main Sep 6, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed-point matrix multiplication improvements #2062

Fixed-point matrix multiplication improvements #2062

fredrik-johansson commented Sep 5, 2024

Fixed-point matrix multiplication improvements #2062

Fixed-point matrix multiplication improvements #2062

Conversation

fredrik-johansson commented Sep 5, 2024