You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Matrix multiplication for 64-bit floats and all complex types on the CPU currently goes through Eigen. That leaves a bit of performance on the table, particularly for complex types and for Zen chips. Normally you can coax Eigen to use MKL (not oneDNN, the original MKL) but you can't do this for the Eigen CXX11 tensor library which is used in XLA.
I think it would be reasonable to use BLIS as an alternative CPU backend, primarily for the sake of complex floating point performance. BLIS is quite good on both Intel and AMD CPUs; ARM too.
See benchmark on Zen 1 architecture:
There are practical use cases that benefit from fast complex matmul on CPU. At least one quantum chemistry code uses JAX for AD; most users of this package probably don't have a GPU with good FP64 capability.
I've started to implement a small bit of this and want to know if anyone else is interested. Unfortunately, I'm not used to Bazel, so if anyone has advice regarding that, please help :)
Additional reasons why this isn't a bad idea:
BLIS can be statically linked
Unlike Eigen, BLIS can detect the CPU at runtime and pick a fast kernel. You can get near-MKL performance with a single binary wheel.
Multithreading (pthreads) support is pretty flexible. You can keep being picky about multithreading without linking OpenMP.
The API supports arbitrary row and column strides
3-clause BSD license
Should get regular maintenance for a while because AMD relies on it
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Matrix multiplication for 64-bit floats and all complex types on the CPU currently goes through Eigen. That leaves a bit of performance on the table, particularly for complex types and for Zen chips. Normally you can coax Eigen to use MKL (not oneDNN, the original MKL) but you can't do this for the Eigen CXX11 tensor library which is used in XLA.
I think it would be reasonable to use BLIS as an alternative CPU backend, primarily for the sake of complex floating point performance. BLIS is quite good on both Intel and AMD CPUs; ARM too.
See benchmark on Zen 1 architecture:
There are practical use cases that benefit from fast complex matmul on CPU. At least one quantum chemistry code uses JAX for AD; most users of this package probably don't have a GPU with good FP64 capability.
I've started to implement a small bit of this and want to know if anyone else is interested. Unfortunately, I'm not used to Bazel, so if anyone has advice regarding that, please help :)
Additional reasons why this isn't a bad idea:
Beta Was this translation helpful? Give feedback.
All reactions