Use preferences to switch between SIMD and KernelAbstractions #133

vchuravy · 2024-06-20T17:33:55Z

I was experimenting with using PrecompileTools on WaterLily, and the choice to dispatch to the SIMD
backend depending on the nthreads variable caused issues.

In current versions of Julia nthreads is no longer constant.
If someone precompiles code nthreads == 1 in the precompilation process, thus exercising the wrong code path.

Opening this as a draft for now to solicit feedback. One probably would need to change the tests such that both code-paths are tested.

b-fg · 2024-06-21T08:36:30Z

Thanks for catching that. I was aware that nthreads==1 during precompilation was problematic, but during execution it was working as intended. Using Preferences seems like a nice workaround. I will do some tests and integrate it.

Also, not specifying the workgroup size did not yield to noticeable performance increase compared to 64 in the past (iirc). Has something changed in KA related to this? Is it anyways the recommended guideline to setup kernels?

vchuravy · 2024-06-21T16:22:38Z

Is it anyways the recommended guideline to setup kernels?

It is a bit tricky between CPU and GPU. Right now the the KA backend on the CPU is rather slow since the basecase size is small. The CPU does much better with larger basecases. Now we don't have a way to calculate that basecase automatically so we use 1024 on the CPU as a default.

On the GPU a static basecase is nice since it allows for some of the index integer operations to be optimized away.

b-fg · 2024-06-21T19:41:30Z

I did some preliminary benchmarks with different mesh sizes N=2^(3*p) using this PR. Overall, it seems that the current PR is a bit slower than master on GPU. The only main difference is that the workgroupsize is now not specified. Results are below, where the commits (which are wrongly tagged) refer to 33933fd==PR and a8a2506==master:

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     3028166 │   1.41 │     0.58 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     2672719 │   2.11 │     0.55 │     1.05 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2671525 │   1.42 │     0.79 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     2339494 │   1.41 │     0.78 │     1.01 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 8
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2085611 │   0.38 │     2.98 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     1816307 │   0.25 │     2.79 │     1.07 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 9
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2160883 │   0.08 │    21.20 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     1798143 │   0.05 │    19.42 │     1.09 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

vchuravy · 2024-06-21T20:03:27Z

It would be interesting to use CUDA.@profile to see if the kernel slowed down or the "auto-tunning" adds that overhead

weymouth · 2024-07-22T14:45:16Z

On my laptop GPU, I found no regression with this PR. In fact a very small speed up:

TGV (b01cdce is this PR, 5c78c37 is this PR with 64 workgroup size, f38bea4 is master)

▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     3166654 │   0.38 │     2.28 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2745665 │   0.58 │     2.87 │     0.80 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2799117 │   0.66 │     2.37 │     0.96 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     2787354 │   0.12 │     7.82 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2394736 │   0.19 │     7.87 │     0.99 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2442026 │   0.15 │     7.80 │     1.00 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

Jelly

▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     2976119 │   0.53 │     1.82 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2602224 │   0.46 │     2.01 │     0.91 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2652446 │   0.47 │     1.97 │     0.93 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     3166982 │   0.24 │     5.45 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2747379 │   0.17 │     5.74 │     0.95 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2801011 │   0.15 │     5.75 │     0.95 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

b-fg · 2024-08-01T11:01:58Z

I did some more benchmarks after a local merge of master with this PR. All looks good except removing the workgroup size as we had it it before (64). Here 9b6ca77 is this PR merged with master and no workgroup size, and backends is this PR merged with master and with workgroup size 64. There is something going on for the CPU backend of KA when not specifying the workgroup size, making it slower than the serial SIMD version. This is with latest KA version (0.9.22).

Benchmarks

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.37 │           395.64 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.26 │           391.32 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.33 │           394.04 │     1.00 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     2223302 │   0.00 │     3.31 │           126.28 │     3.13 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     2187731 │   0.00 │    17.90 │           682.65 │     0.58 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.14 │           119.75 │     3.30 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3503858 │   0.00 │     3.22 │           122.65 │     3.23 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     3465887 │   0.00 │    16.89 │           644.44 │     0.61 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.37 │           128.56 │     3.08 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2619999 │   0.00 │     0.66 │            25.09 │    15.77 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     3030802 │   0.00 │     0.65 │            24.62 │    16.07 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2671213 │   0.00 │     0.63 │            24.02 │    16.47 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       70606 │   0.00 │    58.01 │           276.59 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.53 │           274.34 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       70606 │   0.00 │    73.38 │           349.91 │     0.79 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     1976782 │   0.00 │    18.52 │            88.29 │     3.13 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1945182 │   0.00 │    66.66 │           317.85 │     0.87 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2021882 │   0.00 │    18.50 │            88.21 │     3.14 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3114382 │   0.00 │    20.32 │            96.90 │     2.85 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     3082782 │   0.00 │    63.37 │           302.19 │     0.92 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3159482 │   0.00 │    19.00 │            90.61 │     3.05 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2301906 │   0.00 │     3.11 │            14.82 │    18.66 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     2683290 │   0.00 │     3.24 │            15.46 │    17.89 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2347006 │   0.00 │     3.06 │            14.59 │    18.96 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.76 │           591.96 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.74 │           590.32 │     1.00 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.69 │           586.70 │     1.01 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     5428158 │   1.55 │     4.46 │           340.46 │     1.74 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     5341074 │   0.27 │    26.74 │          2040.17 │     0.29 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     5549932 │   1.54 │     4.52 │           344.74 │     1.72 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     8524182 │   2.39 │     4.92 │           375.75 │     1.58 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     8437098 │   0.21 │    26.96 │          2057.04 │     0.29 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     8645956 │   0.00 │     4.89 │           372.75 │     1.59 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     6416700 │   0.00 │     1.46 │           111.76 │     5.30 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     7253464 │   0.00 │     1.48 │           112.95 │     5.24 │
│    CUDA │    master │ 1.10.4 │   Float32 │     6542128 │   0.00 │     1.45 │           110.63 │     5.35 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      230016 │   0.00 │    59.27 │           565.23 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.88 │           561.49 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.31 │           556.05 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     6408293 │   0.38 │    20.22 │           192.87 │     2.93 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     6305699 │   0.07 │   121.32 │          1156.96 │     0.49 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     6552597 │   0.38 │    20.15 │           192.15 │     2.94 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │    10078241 │   0.69 │    21.92 │           209.08 │     2.70 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     9975647 │   0.12 │   121.08 │          1154.70 │     0.49 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │    10222545 │   0.88 │    21.55 │           205.50 │     2.75 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     7642918 │   0.00 │     4.69 │            44.77 │    12.63 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     8713607 │   0.00 │     4.73 │            45.10 │    12.53 │
│    CUDA │    master │ 1.10.4 │   Float32 │     7792785 │   0.00 │     4.70 │            44.78 │    12.62 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

vchuravy · 2024-08-01T16:11:34Z

So the default workgroupsize for KA is 1024. With 64 you create a lot of small tasks, what is the typical ndrange you use?

b-fg · 2024-08-01T16:40:32Z

For example, the TGV case is a 3D case for which I tested domain sizes of 64^3 and 128^3. The arrays we use are then (64,64,64) and (64,64,64,3) (analogously for the 128^3 grid), which is the ndrange we typically pass into the kernel. Also, I am not sure I tested this PR before with multi-threading on the CPU backend... I think it was just on thje GPU (as reported previously).

vchuravy · 2024-08-02T07:54:26Z

Ah so you are getting perfectly sized blocks, by accident xD

You may want to use (64, 64) instead as the workgroup size.

b-fg · 2024-08-04T21:12:43Z

Sure, I will do some tests after my summer break. But does this mean that we cannot use the default workgrup size (as in this PR)? Could this be something to improve in KA, where it would try to automatically determine it based on ndrange?

vchuravy · 2024-08-05T05:24:05Z

Yeah I will need to improve this on the KA side

vchuravy · 2024-08-07T12:23:28Z

I just tagged a new KA version with the fix. This might remove the need for the SIMD variant entirely.

b-fg · 2024-08-21T10:02:02Z

I have tested the changes and while the results improve, it is still not there (again, 9b6ca77 is this PR). There might be something else going on but unsure what at the moment...

Benchmarks

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.42 │           397.53 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.24 │           390.64 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.29 │           392.71 │     1.01 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     2223302 │   0.00 │     3.34 │           127.43 │     3.12 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1993389 │   0.00 │     4.25 │           162.06 │     2.45 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.20 │           121.89 │     3.26 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3503858 │   0.00 │     3.24 │           123.48 │     3.22 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     2647077 │   0.00 │     4.41 │           168.25 │     2.36 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.32 │           126.53 │     3.14 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2621768 │   0.00 │     0.65 │            24.76 │    16.05 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     3026963 │   0.00 │     0.63 │            23.91 │    16.62 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2671140 │   0.00 │     0.68 │            25.79 │    15.42 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       70606 │   0.00 │    58.85 │           280.64 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.75 │           275.38 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.62 │           274.74 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     1976782 │   0.00 │    20.92 │            99.76 │     2.81 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1819590 │   0.00 │    24.61 │           117.34 │     2.39 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2021882 │   0.00 │    21.37 │           101.89 │     2.75 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3114382 │   0.00 │    19.24 │            91.74 │     3.06 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     2737306 │   0.00 │    25.67 │           122.38 │     2.29 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3159482 │   0.00 │    22.54 │           107.47 │     2.61 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2303706 │   0.00 │     3.09 │            14.71 │    19.07 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     2680490 │   0.00 │     3.25 │            15.48 │    18.12 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2347008 │   0.00 │     3.16 │            15.05 │    18.65 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.89 │           601.60 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.71 │           588.41 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.71 │           588.06 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     5428158 │   1.59 │     4.54 │           346.53 │     1.74 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     3954012 │   0.00 │     4.35 │           331.67 │     1.81 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     5549932 │   1.49 │     4.73 │           361.22 │     1.67 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     8524182 │   0.00 │     4.85 │           369.75 │     1.63 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     5237268 │   0.00 │     5.04 │           384.33 │     1.57 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     8645956 │   2.37 │     5.15 │           392.91 │     1.53 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     6416699 │   0.00 │     1.45 │           110.72 │     5.43 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     7253470 │   0.00 │     1.47 │           112.14 │     5.36 │
│    CUDA │    master │ 1.10.4 │   Float32 │     6538380 │   0.00 │     1.48 │           112.78 │     5.33 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      230016 │   0.00 │    59.27 │           565.22 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.71 │           559.87 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.31 │           556.06 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     6408293 │   0.37 │    20.44 │           194.94 │     2.90 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     5129556 │   0.00 │    29.92 │           285.32 │     1.98 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     6552597 │   0.39 │    21.95 │           209.37 │     2.70 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │    10078241 │   0.81 │    21.45 │           204.58 │     2.76 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     7343892 │   0.00 │    30.12 │           287.21 │     1.97 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │    10222545 │   0.61 │    22.87 │           218.09 │     2.59 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     7642918 │   0.00 │     4.69 │            44.69 │    12.65 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     8717354 │   0.00 │     4.77 │            45.46 │    12.43 │
│    CUDA │    master │ 1.10.4 │   Float32 │     7787222 │   0.00 │     4.70 │            44.86 │    12.60 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

marinlauber · 2024-09-13T09:31:10Z

@b-fg Something I picked up out today. Currently, the maintest will run on CuArray only if you have nvcc installed. If you use the Julia CUDA compiler it doesn't install nvcc (at least not on my system). Same goes for AMG GPU I suppose.

It's kind of related to this I suppose, that's why I added it here.

b-fg · 2024-09-13T09:49:36Z

~~Ah but this is not a problem of this PR, but a problem of WaterLily-Benchmarks, right? If you open an issue there, we can iterate on it.~~

You mean these test lines , right?

WaterLily.jl/test/runtests.jl

Line 6 in c4d3500

_cuda = check_compiler("nvcc","release")

This is not related to this PR though. The problem is how to automatically detect that CUDA is available without loading CUDA.jl before, and come up with something that works for all OS.

vchuravy added 4 commits June 20, 2024 13:31

Use preferences to switch between SIMD and KernelAbstractions

1f91eda

fixup! Use preferences to switch between SIMD and KernelAbstractions

4b7a846

fixup! Use preferences to switch between SIMD and KernelAbstractions

e77bca4

don't limit workgroups to 64

b01cdce

b-fg mentioned this pull request Jul 23, 2024

AMDGPU downgrades Waterlily #145

Closed

vchuravy mentioned this pull request Aug 2, 2024

Default CPU workgroupsize can be inadequat for higher-dimensionsal kernels JuliaGPU/KernelAbstractions.jl#499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use preferences to switch between SIMD and KernelAbstractions #133

Use preferences to switch between SIMD and KernelAbstractions #133

vchuravy commented Jun 20, 2024

b-fg commented Jun 21, 2024 •

edited

Loading

vchuravy commented Jun 21, 2024

b-fg commented Jun 21, 2024

vchuravy commented Jun 21, 2024

weymouth commented Jul 22, 2024

b-fg commented Aug 1, 2024

vchuravy commented Aug 1, 2024

b-fg commented Aug 1, 2024 •

edited

Loading

vchuravy commented Aug 2, 2024

b-fg commented Aug 4, 2024

vchuravy commented Aug 5, 2024

vchuravy commented Aug 7, 2024

b-fg commented Aug 21, 2024 •

edited

Loading

marinlauber commented Sep 13, 2024 •

edited

Loading

b-fg commented Sep 13, 2024 •

edited

Loading

Use preferences to switch between SIMD and KernelAbstractions #133

Are you sure you want to change the base?

Use preferences to switch between SIMD and KernelAbstractions #133

Conversation

vchuravy commented Jun 20, 2024

b-fg commented Jun 21, 2024 • edited Loading

vchuravy commented Jun 21, 2024

b-fg commented Jun 21, 2024

vchuravy commented Jun 21, 2024

weymouth commented Jul 22, 2024

b-fg commented Aug 1, 2024

vchuravy commented Aug 1, 2024

b-fg commented Aug 1, 2024 • edited Loading

vchuravy commented Aug 2, 2024

b-fg commented Aug 4, 2024

vchuravy commented Aug 5, 2024

vchuravy commented Aug 7, 2024

b-fg commented Aug 21, 2024 • edited Loading

marinlauber commented Sep 13, 2024 • edited Loading

b-fg commented Sep 13, 2024 • edited Loading

b-fg commented Jun 21, 2024 •

edited

Loading

b-fg commented Aug 1, 2024 •

edited

Loading

b-fg commented Aug 21, 2024 •

edited

Loading

marinlauber commented Sep 13, 2024 •

edited

Loading

b-fg commented Sep 13, 2024 •

edited

Loading