Use BLIS in CPU dot op emitter? #13417

chillenb · 2024-06-05T09:46:46Z

chillenb
Jun 5, 2024

Matrix multiplication for 64-bit floats and all complex types on the CPU currently goes through Eigen. That leaves a bit of performance on the table, particularly for complex types and for Zen chips. Normally you can coax Eigen to use MKL (not oneDNN, the original MKL) but you can't do this for the Eigen CXX11 tensor library which is used in XLA.

I think it would be reasonable to use BLIS as an alternative CPU backend, primarily for the sake of complex floating point performance. BLIS is quite good on both Intel and AMD CPUs; ARM too.

See benchmark on Zen 1 architecture:

There are practical use cases that benefit from fast complex matmul on CPU. At least one quantum chemistry code uses JAX for AD; most users of this package probably don't have a GPU with good FP64 capability.

I've started to implement a small bit of this and want to know if anyone else is interested. Unfortunately, I'm not used to Bazel, so if anyone has advice regarding that, please help :)

Additional reasons why this isn't a bad idea:

BLIS can be statically linked
Unlike Eigen, BLIS can detect the CPU at runtime and pick a fast kernel. You can get near-MKL performance with a single binary wheel.
Multithreading (pthreads) support is pretty flexible. You can keep being picky about multithreading without linking OpenMP.
The API supports arbitrary row and column strides
3-clause BSD license
Should get regular maintenance for a while because AMD relies on it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use BLIS in CPU dot op emitter? #13417

{{title}}

Replies: 0 comments

Select a reply

Use BLIS in CPU dot op emitter? #13417

chillenb Jun 5, 2024

Replies: 0 comments

chillenb
Jun 5, 2024