OpenBLAS on aarch substantially slower than other BLAS flavours #160

h-vetinari · 2024-06-14T03:23:30Z

In the context of the BLAS variant testing for scipy, I noticed that the OpenBLAS runs were much slower. I did some basic timing comparisons based on what pytest reports as the overall runtime of the scipy test suite (the timing also depends quite a bit on whether the agent has AVX512F/AVX512CD or not, and this is random across Azure's fleet, so I'm taking the average across implementation & CPU type):

Flavour	Avg. (AVX2)	#runs	Avg. (AVX512)	#runs
blis	00:26:10.0	2	00:18:28.2	6
mkl	00:24:35.4	5	00:18:56.4	5
netlib	00:25:44.7	3	00:19:03.0	5
openblas	01:34:15.0	2	01:24:06.8	4

Overall, OpenBLAS ends up being between 3-5x slower than all the other BLAS/LAPACK implementations, which to me is indicative of something going very wrong somewhere.

CC @martin-frbg @rgommers

martin-frbg · 2024-06-14T06:30:40Z

All these numbers look a little strange to me - why would MKL be only marginally better than the entirely unoptimized reference implementation ? And I'd expect even the baseline OpenBLAS - an MSVC build that can only make use of the generic C sources - to be slightly faster than the reference as well.

h-vetinari · 2024-06-14T06:55:00Z

why would MKL be only marginally better than the entirely unoptimized reference implementation?

The answer is almost certainly that the cumulated runtime of blas/lapack calls as a portion of the total runtime of the entire scipy test suite gets dominated by random variability of the CI agent (memory and CPU contention, caches etc.).

Under this hypothesis, only when something takes way longer on average in the blas/lapack interface does it actually make a difference.

(The difference between CI agents is likely also determined by other aspects than just the available CPU instructions too - numpy does runtime dispatching, but none of our blas/lapack builds have been compiled to take advantage of avx instructions, for example)

rgommers · 2024-06-14T07:52:00Z

Sounds like the issue in OpenMathLib/OpenBLAS#4582. Was the openblas package in conda-forge rebuilt with the fix included?

h-vetinari · 2024-06-14T11:27:03Z

Sounds like the issue in OpenMathLib/OpenBLAS#4582. Was the openblas package in conda-forge rebuilt with the fix included?

That issue was fixed in OpenMathLib/OpenBLAS#4587 which landed in 0.3.27 (which is the version we're using across conda-forge already).

rgommers · 2024-06-14T11:55:19Z

Hmm, that needs investigating then. The timings make it almost certain to be a similar deadlock lock contention issue. It's a 10x-100x slowdown for BLAS/LAPACK calls to explain the test suite being that slow.

There was a similar issue at scipy/scipy#20585 (comment).

martin-frbg · 2024-06-14T12:15:26Z

There have been no other changes to the Windows thread server code in 0.3.27 since then (as far as I am aware right now), so I think the best option for testing would be to swap in the old version of blas_server_win32.c from OpenMathLib/OpenBLAS@66904f8 (this will probably need adding back the global declaration of int blas_omp_threads_local = 1; but hopefully no other changes). Unfortunately I do not have local access to a Windows machine where I am right now

martin-frbg · 2024-06-14T12:19:39Z

(though I would assume that any residual problem in mseminatore's PRs should have come up when testing the fix in PR4587)

martin-frbg · 2024-06-21T13:46:47Z

The test case from OpenMathLib/OpenBLAS#4582 passes for me without any apparent delays, so I don't think it's that.

martin-frbg · 2024-06-21T13:50:35Z

Also seems to me the "similar" issue is/was plagued by some kind of infighting between duplicate libraries ?

h-vetinari · 2024-06-22T00:14:59Z

Also seems to me the "similar" issue is/was plagued by some kind of infighting between duplicate libraries ?

That's an interesting conjecture, though I don't see how that can happen in conda-forge, where we generally take care to unvendor things so there's only one copy. Especially for numpy and scipy, where we keep a close look at the builds

isuruf · 2024-06-22T00:40:32Z

Can you post a link to the logs?

h-vetinari · 2024-06-22T02:01:30Z

Can you post a link to the logs?

Here you go (plus any other still remaining runs of TEST: 1.13.x + blas variants on the scipy feedstock, before azure deletes them after a month).

martin-frbg · 2024-06-22T11:03:35Z

First impression is that it appears to be linalg/tests/test_basic.py::TestLstsq::test_random_complex_exact that is taking unusually long, and several runs with what look like markedly shorter times simply aborted due to some missed dependencies in the python framework.

h-vetinari · 2024-06-22T11:49:46Z

and several runs with what look like markedly shorter times simply aborted due to some missed dependencies in the python framework.

In those PRs you cannot just look at the overall runtime, because we're trying to forcefully make a distinction between CPUs with/without AVX512F/AVX512CD, and the abort if the CPU architecture expectation isn't met (because some past & present failures had different behaviours, and azure doesn't provide a way to influence that). However, if that happens, then there's zero runtime of the test suite.

On windows, roughly everything over 60min actually ran the tests, everything under 50min didn't, and in between it depends on circumstances (fast or slow agent).

h-vetinari · 2024-06-22T11:54:24Z

First impression is that it appears to be [...] that is taking unusually long

It's quite unlikely that a single test (or anything less than widespread/systemic) can blow out the test times like that. This is especially the case as many tests are completely unrelated to blas. Tests also have a 20min timeout, and none of them is hit.

martin-frbg · 2024-07-27T12:59:52Z

umm, did you ever get around to testing with the pre-0.3.26 blas_server_win32.c as per #160 (comment) ?

h-vetinari · 2024-07-28T00:22:35Z

Thanks for the reminder, I didn't understand the ask at the time, but I've dug a bit into the git history and I think/hope I got it right; currently building an openblas version with this in #162; we might already see a difference in the blas testing on this feedstock (compare "Good news #2" from here), and if that's not conclusive, we can publish those builds to a separate label to test them in SciPy.

martin-frbg · 2024-07-28T10:44:39Z

thank you - meanwhile it looks like i might have to revert that PR anyway due to the sunpy thread safety issue that came up on numpy :(

h-vetinari · 2024-12-10T07:46:19Z

FWIW, this pattern is still occurring in a slightly different with current OpenBLAS 0.3.28 -- windows builds have pretty homogeneous times now, but the aarch builds with OpenBLAS are taking substantially longer than with netlib (interestingly; neither implementation is affected on PPC).

Both aarch/ppc get emulated through QEMU. We recently upgraded to the newest QEMU 9.1.2, which could play a role. Just thought I'd mention it here. Taking the median run (out of 5 each; for more details see logs), in hours:

Flavour	aarch	ppc
netlib	01:38:53	01:40:58
openblas	04:14:40	01:33:32

(there's also some failures specific to aarch+openblas, so this might just be an unhappy combination for now).

martin-frbg · 2024-12-10T11:46:28Z

This could be related to unprofitable forwarding of GEMM calls to GEMV ( OpenMathLib/OpenBLAS#4951 already fixed on the develop branch), though I'd not expect that to be worse than the netlib implementation. It is the only major regression known to affect aarch64 targets - absurdly a lot of time went into chasing down an elusive slowdown on NeoverseV2 that could have been better spent pushing the 0.3.29 release.

I don't quite see where you're getting all the Cholesky errors from - what is your qemu emulating here, ARMV8 or ARMV8SVE ?

h-vetinari · 2024-12-10T17:27:44Z

what is your qemu emulating here, ARMV8 or ARMV8SVE ?

We take qemu-user-static from Fedora, I'm not aware that we're overriding their defaults (nor what they are). Perhaps @isuruf knows?

martin-frbg · 2024-12-10T19:50:11Z

Ok, I think qemu-user is providing "generic" emulation only, so probably ARMV8. I'm currently trying to run the scipy 1.14.1 testsuite on an Ampere Altra in the GCC Compile farm to check if I can find any indication of recent slowdown.

martin-frbg · 2024-12-11T14:04:50Z

Did a few scipy_test runs with a pip-installed scipy 1.14.1 on the 64-core Ampere Altra in the GCC Compile Farm (ARMV8 target, Debian Linux, gcc-14.2), did not see any indication of a marked slowdown with recent builds so far ?

library version
current 0.3.28.dev	2384.5 user	106.41 system	16:49.26 elapsed
Oct 3 0.3.28.dev (624e9...)	2364.01 user	95.90 system	16:42.31 elapsed
0.3.28	2395.74 user	102.25 system	16.47.68 elapsed
stock scipy(0.3.27-a7b41fb3)	2399.01 user	103.98 system	16:49.26 elapsed
0.3.26	2572.36 user	104.86 system	16:48.94 elapsed
0.3.25	2647.09 user	108.65 system	16:53.63 elapsed

h-vetinari · 2024-12-14T05:33:18Z

Thanks for the testing! Glad to hear that baremetal performance looks much better than the emulated one!

FWIW, this is what numpy CPU feature detection thinks is present in the aarch-on-QEMU emulation:

  "SIMD Extensions": {
    "baseline": [
      "NEON",
      "NEON_FP16",
      "NEON_VFPV4",
      "ASIMD"
    ],
    "found": [
      "ASIMDHP",
      "ASIMDFHM",
      "SVE"
    ]
  }

martin-frbg · 2024-12-14T14:23:18Z

Interesting that it sees SVE, I did not expect that. Could be that this is what is slowing it down, depending on how good the emulation is - if is it translating into AVX512, or simple serial code. (That is, if you are running the test without forcing OPENBLAS_CORETYPE to ARMV8.)
There is an open ticket for a claimed slowdown in 0.3.28 that the reporter said he bisected to introduction of a separate SVE S/DGEMM kernel for small matrix sizes but no reproducer and no response to further questions

martin-frbg · 2024-12-18T22:42:57Z

I have now run a series of scipy_test runs for all releases since 0.3.23 plus a number of commit hashes from recent development, using my Pixel8pro with TARGET=NEOVERSEV1 as a low-end SVE machine. Runtime turned out to be fairly stable between them, with considerable jitter between individual runs of the same build (despite having the phone on charger and with no other foreground apps open besides termux)

h-vetinari mentioned this issue Jun 14, 2024

TEST: 1.13.x + blas variants conda-forge/scipy-feedstock#273

Closed

h-vetinari mentioned this issue Jun 25, 2024

TEST: 1.14.x + blas variants conda-forge/scipy-feedstock#280

Draft

This was referenced Jul 8, 2024

Support building with flang on windows OpenMathLib/OpenBLAS#4768

Open

add openmp on windows #115

Merged

h-vetinari mentioned this issue Jul 28, 2024

[dev] use old logic from blas_server_win32.c #162

Merged

This was referenced Jul 29, 2024

BUG: Matrix multiplication in windows produces large numerical errors inconsistently in numpy >= 2 numpy/numpy#27036

Closed

Temporarily revert to the coarse-grained locking in the Windows thread server OpenMathLib/OpenBLAS#4835

Merged

h-vetinari changed the title ~~OpenBLAS on windows substantially slower than other BLAS flavours~~ OpenBLAS on aarch substantially slower than other BLAS flavours Dec 15, 2024

h-vetinari mentioned this issue Dec 15, 2024

SciPy v1.15.0 conda-forge/scipy-feedstock#284

Merged

h-vetinari mentioned this issue Dec 22, 2024

Back to QEMU 8.x conda-forge/docker-images#300

Merged

h-vetinari mentioned this issue Jan 9, 2025

TEST: 1.15.x + blas variants conda-forge/scipy-feedstock#295

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBLAS on aarch substantially slower than other BLAS flavours #160

OpenBLAS on aarch substantially slower than other BLAS flavours #160

h-vetinari commented Jun 14, 2024

martin-frbg commented Jun 14, 2024

h-vetinari commented Jun 14, 2024 •

edited

Loading

rgommers commented Jun 14, 2024

h-vetinari commented Jun 14, 2024

rgommers commented Jun 14, 2024

martin-frbg commented Jun 14, 2024

martin-frbg commented Jun 14, 2024

martin-frbg commented Jun 21, 2024

martin-frbg commented Jun 21, 2024

h-vetinari commented Jun 22, 2024

isuruf commented Jun 22, 2024

h-vetinari commented Jun 22, 2024

martin-frbg commented Jun 22, 2024

h-vetinari commented Jun 22, 2024

h-vetinari commented Jun 22, 2024

martin-frbg commented Jul 27, 2024

h-vetinari commented Jul 28, 2024

martin-frbg commented Jul 28, 2024

h-vetinari commented Dec 10, 2024 •

edited

Loading

martin-frbg commented Dec 10, 2024

h-vetinari commented Dec 10, 2024

martin-frbg commented Dec 10, 2024

martin-frbg commented Dec 11, 2024 •

edited

Loading

h-vetinari commented Dec 14, 2024

martin-frbg commented Dec 14, 2024

martin-frbg commented Dec 18, 2024 •

edited

Loading

OpenBLAS on aarch substantially slower than other BLAS flavours #160

OpenBLAS on aarch substantially slower than other BLAS flavours #160

Comments

h-vetinari commented Jun 14, 2024

martin-frbg commented Jun 14, 2024

h-vetinari commented Jun 14, 2024 • edited Loading

rgommers commented Jun 14, 2024

h-vetinari commented Jun 14, 2024

rgommers commented Jun 14, 2024

martin-frbg commented Jun 14, 2024

martin-frbg commented Jun 14, 2024

martin-frbg commented Jun 21, 2024

martin-frbg commented Jun 21, 2024

h-vetinari commented Jun 22, 2024

isuruf commented Jun 22, 2024

h-vetinari commented Jun 22, 2024

martin-frbg commented Jun 22, 2024

h-vetinari commented Jun 22, 2024

h-vetinari commented Jun 22, 2024

martin-frbg commented Jul 27, 2024

h-vetinari commented Jul 28, 2024

martin-frbg commented Jul 28, 2024

h-vetinari commented Dec 10, 2024 • edited Loading

martin-frbg commented Dec 10, 2024

h-vetinari commented Dec 10, 2024

martin-frbg commented Dec 10, 2024

martin-frbg commented Dec 11, 2024 • edited Loading

h-vetinari commented Dec 14, 2024

martin-frbg commented Dec 14, 2024

martin-frbg commented Dec 18, 2024 • edited Loading

h-vetinari commented Jun 14, 2024 •

edited

Loading

h-vetinari commented Dec 10, 2024 •

edited

Loading

martin-frbg commented Dec 11, 2024 •

edited

Loading

martin-frbg commented Dec 18, 2024 •

edited

Loading