Fixed-point matrix multiplication improvements #2062
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We define
add_ss*
/sub_dd*
to exist all the way up to 8 limbs for all architectures. Note that this is a bit iffy, as the C fallbacks might produce poor assembly depending on the compiler and the inline asm versions can exhaust the register allocation. On x86-32 I had to switch to C fallbacks for the longer macros. I think with 7-8 operands failures are possible on x86-64 too depending on how the macros are used, though the uses in the current codebase don't seem to hit this limit. All of this trouble because compilers don't understand carry flags, sigh.Fixed-point matrix multiplication is optimized for medium size matrices by using dot products with inlined code and by incorporating Strassen multiplication, along with new tuning values. The internal
nfixed
representation is also made semi-public.I will post some more notes about possible improvements in a followup issue.
Speedup for
nfloat_mat_mul
(with uniform matrices) in this PR:Speedup of
nfloat_complex_mat_mul
: