You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Warmup: 1.1911 s, result 224 (displayed to avoid compiler optimizing warmup away)
Production implementation for tensor iteration - float64
Collected 1000 samples in 8.931 seconds
Average time: 8.930ms
Stddev time: 0.345ms
Min time: 8.723ms
Max time: 14.785ms
Display output[[0,0]] to make sure it's not optimized away-0.41973403633413Production implementation for tensor iteration - float64Collected 1000 samples in 29.597 secondsAverage time: 29.597msStddev time: 1.125msMin time: 27.002msMax time: 38.606msDisplay output[[0,0]] to make sure it's not optimized away
1.143903810108473
and with OpenMP
Warmup: 1.1874 s, result 224 (displayed to avoid compiler optimizing warmup away)
Production implementation for tensor iteration - float64
Collected 1000 samples in 4.094 seconds
Average time: 4.092ms
Stddev time: 0.206ms
Min time: 3.897ms
Max time: 5.459ms
Display output[[0,0]] to make sure it's not optimized away-0.41973403633413Production implementation for tensor iteration - float64Collected 1000 samples in 24.025 secondsAverage time: 24.022msStddev time: 1.127msMin time: 22.379msMax time: 33.763msDisplay output[[0,0]] to make sure it's not optimized away
1.143903810108473
Potential explanations:
hitting the maximum memory bandwidth:
Parallel strided iteration also means parallel cache misses.
One way to alleviate that would be to use GCC/Clang builtin_prefetch
Note: this does not reflect experiment in https://github.com/zy97140/omp-benchmark-for-pytorch. Furthermore next element computation shouldn't require any memory access.
Unfortunately, even if it doesn't require memory access, the CPU cannot leverage instruction level parallelism to execute the main computation at the same time as computing next elements location as there is branching.
accesses to shape and strides for strided iteration are slow:
This might happen if the compiler doesn't create a local copy or save them in registers if it doesn't detect that no shape/strides access mutate them.
strided iteration is running out of registers and/or instruction cache space
Unlikely are those are per-CPU and not per socket we should have a linear increase
The text was updated successfully, but these errors were encountered:
On the following benchmark on my dual core machine with hyper-threading enabled:
https://github.com/numforge/laser/blob/d725651c0f8ca1da3f761e27e9846bc81b9341f3/benchmarks/loop_iteration/iter_bench_prod.nim (branch foreach-dsl #4)
I get the following score without OpenMP
and with OpenMP
Potential explanations:
Unfortunately, even if it doesn't require memory access, the CPU cannot leverage instruction level parallelism to execute the main computation at the same time as computing next elements location as there is branching.
The text was updated successfully, but these errors were encountered: