You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With #20, the parallel schedule seems to scale perfectly on many cores:
$ OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 ./build/gemm_f32_serialWarmup: 0.9036 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 230.400 GFLOP/s
Theoretical peak multi: 4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
OpenBLAS benchmark
Collected 10 samples in 1.238 seconds
Average time: 123.713 ms
Stddev time: 0.444 ms
Min time: 123.335 ms
Max time: 124.890 ms
Perf: 114.425 GFLOP/s
Laser production implementation
Collected 10 samples in 1.465 seconds
Average time: 146.392 ms
Stddev time: 0.644 ms
Min time: 146.006 ms
Max time: 147.802 ms
Perf: 96.697 GFLOP/s
Mean Relative Error compared to OpenBLAS: 1.243059557509696e-07
------------------------------------------------------------
$ ./build/gemm_f32_omp
Warmup: 0.9021 s, result 224 (displayed to avoid compiler optimizing warmup away)
A matrix shape: (M: 1920, N: 1920)
B matrix shape: (M: 1920, N: 1920)
Output shape: (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes: 29.491 MB
Arithmetic intensity: 480.000 FLOP/byte
Theoretical peak single-core: 230.400 GFLOP/s
Theoretical peak multi: 4147.200 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.
OpenBLAS benchmark
Collected 10 samples in 0.079 seconds
Average time: 7.739 ms
Stddev time: 4.368 ms
Min time: 6.020 ms
Max time: 20.097 ms
Perf: 1829.200 GFLOP/s
Laser production implementation
Collected 10 samples in 0.083 seconds
Average time: 8.126 ms
Stddev time: 4.777 ms
Min time: 6.241 ms
Max time: 21.632 ms
Perf: 1742.123 GFLOP/s
Mean Relative Error compared to OpenBLAS: 0.01456451416015625
with 96.7 GFLOP/s * 18 cores = 1740 on my machine.
However the single-threaded implementation is still quite often below OpenBLAS.
Note that with the new AVX512 you do not need explicit broadcast saving on registers.
Unfortunately there is no way to ensure the compiler uses those, GCC fails to before GCC9:
With #20, the parallel schedule seems to scale perfectly on many cores:
with 96.7 GFLOP/s * 18 cores = 1740 on my machine.
However the single-threaded implementation is still quite often below OpenBLAS.
Causes:
It should be reintroduced.
The text was updated successfully, but these errors were encountered: