Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use preferences to switch between SIMD and KernelAbstractions #133

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

vchuravy
Copy link

I was experimenting with using PrecompileTools on WaterLily, and the choice to dispatch to the SIMD
backend depending on the nthreads variable caused issues.

  1. In current versions of Julia nthreads is no longer constant.
  2. If someone precompiles code nthreads == 1 in the precompilation process, thus exercising the wrong code path.

Opening this as a draft for now to solicit feedback. One probably would need to change the tests such that both code-paths are tested.

@b-fg
Copy link
Member

b-fg commented Jun 21, 2024

Thanks for catching that. I was aware that nthreads==1 during precompilation was problematic, but during execution it was working as intended. Using Preferences seems like a nice workaround. I will do some tests and integrate it.

Also, not specifying the workgroup size did not yield to noticeable performance increase compared to 64 in the past (iirc). Has something changed in KA related to this? Is it anyways the recommended guideline to setup kernels?

@vchuravy
Copy link
Author

Is it anyways the recommended guideline to setup kernels?

It is a bit tricky between CPU and GPU. Right now the the KA backend on the CPU is rather slow since the basecase size is small. The CPU does much better with larger basecases. Now we don't have a way to calculate that basecase automatically so we use 1024 on the CPU as a default.

On the GPU a static basecase is nice since it allows for some of the index integer operations to be optimized away.

@b-fg
Copy link
Member

b-fg commented Jun 21, 2024

I did some preliminary benchmarks with different mesh sizes N=2^(3*p) using this PR. Overall, it seems that the current PR is a bit slower than master on GPU. The only main difference is that the workgroupsize is now not specified. Results are below, where the commits (which are wrongly tagged) refer to 33933fd==PR and a8a2506==master:

Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     3028166 │   1.41 │     0.58 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     2672719 │   2.11 │     0.55 │     1.05 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2671525 │   1.42 │     0.79 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     2339494 │   1.41 │     0.78 │     1.01 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 8
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2085611 │   0.38 │     2.98 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     1816307 │   0.25 │     2.79 │     1.07 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 9
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   33933fd │ 1.10.2 │   Float32 │     2160883 │   0.08 │    21.20 │     1.00 │
│     GPU │   a8a2506 │ 1.10.2 │   Float32 │     1798143 │   0.05 │    19.42 │     1.09 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

@vchuravy
Copy link
Author

It would be interesting to use CUDA.@profile to see if the kernel slowed down or the "auto-tunning" adds that overhead

@weymouth
Copy link
Collaborator

On my laptop GPU, I found no regression with this PR. In fact a very small speed up:

TGV (b01cdce is this PR, 5c78c37 is this PR with 64 workgroup size, f38bea4 is master)

▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     3166654 │   0.38 │     2.28 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2745665 │   0.58 │     2.87 │     0.80 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2799117 │   0.66 │     2.37 │     0.96 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     2787354 │   0.12 │     7.82 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2394736 │   0.19 │     7.87 │     0.99 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2442026 │   0.15 │     7.80 │     1.00 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

Jelly

▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     2976119 │   0.53 │     1.82 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2602224 │   0.46 │     2.01 │     0.91 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2652446 │   0.47 │     1.97 │     0.93 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────┤
│     GPU │   b01cdce │ 1.10.0 │   Float32 │     3166982 │   0.24 │     5.45 │     1.00 │
│     GPU │   5c78c37 │ 1.10.0 │   Float32 │     2747379 │   0.17 │     5.74 │     0.95 │
│     GPU │   f38bea4 │ 1.10.0 │   Float32 │     2801011 │   0.15 │     5.75 │     0.95 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────┘

@b-fg b-fg mentioned this pull request Jul 23, 2024
@b-fg
Copy link
Member

b-fg commented Aug 1, 2024

I did some more benchmarks after a local merge of master with this PR. All looks good except removing the workgroup size as we had it it before (64). Here 9b6ca77 is this PR merged with master and no workgroup size, and backends is this PR merged with master and with workgroup size 64. There is something going on for the CPU backend of KA when not specifying the workgroup size, making it slower than the serial SIMD version. This is with latest KA version (0.9.22).

Benchmarks
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.37 │           395.64 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.26 │           391.32 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.33 │           394.04 │     1.00 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     2223302 │   0.00 │     3.31 │           126.28 │     3.13 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     2187731 │   0.00 │    17.90 │           682.65 │     0.58 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.14 │           119.75 │     3.30 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3503858 │   0.00 │     3.22 │           122.65 │     3.23 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     3465887 │   0.00 │    16.89 │           644.44 │     0.61 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.37 │           128.56 │     3.08 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2619999 │   0.00 │     0.66 │            25.09 │    15.77 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     3030802 │   0.00 │     0.65 │            24.62 │    16.07 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2671213 │   0.00 │     0.63 │            24.02 │    16.47 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       70606 │   0.00 │    58.01 │           276.59 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.53 │           274.34 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       70606 │   0.00 │    73.38 │           349.91 │     0.79 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     1976782 │   0.00 │    18.52 │            88.29 │     3.13 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1945182 │   0.00 │    66.66 │           317.85 │     0.87 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2021882 │   0.00 │    18.50 │            88.21 │     3.14 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3114382 │   0.00 │    20.32 │            96.90 │     2.85 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     3082782 │   0.00 │    63.37 │           302.19 │     0.92 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3159482 │   0.00 │    19.00 │            90.61 │     3.05 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2301906 │   0.00 │     3.11 │            14.82 │    18.66 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     2683290 │   0.00 │     3.24 │            15.46 │    17.89 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2347006 │   0.00 │     3.06 │            14.59 │    18.96 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.76 │           591.96 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.74 │           590.32 │     1.00 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.69 │           586.70 │     1.01 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     5428158 │   1.55 │     4.46 │           340.46 │     1.74 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     5341074 │   0.27 │    26.74 │          2040.17 │     0.29 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     5549932 │   1.54 │     4.52 │           344.74 │     1.72 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     8524182 │   2.39 │     4.92 │           375.75 │     1.58 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     8437098 │   0.21 │    26.96 │          2057.04 │     0.29 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     8645956 │   0.00 │     4.89 │           372.75 │     1.59 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     6416700 │   0.00 │     1.46 │           111.76 │     5.30 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     7253464 │   0.00 │     1.48 │           112.95 │     5.24 │
│    CUDA │    master │ 1.10.4 │   Float32 │     6542128 │   0.00 │     1.45 │           110.63 │     5.35 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      230016 │   0.00 │    59.27 │           565.23 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.88 │           561.49 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.31 │           556.05 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     6408293 │   0.38 │    20.22 │           192.87 │     2.93 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     6305699 │   0.07 │   121.32 │          1156.96 │     0.49 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     6552597 │   0.38 │    20.15 │           192.15 │     2.94 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │    10078241 │   0.69 │    21.92 │           209.08 │     2.70 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     9975647 │   0.12 │   121.08 │          1154.70 │     0.49 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │    10222545 │   0.88 │    21.55 │           205.50 │     2.75 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     7642918 │   0.00 │     4.69 │            44.77 │    12.63 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     8713607 │   0.00 │     4.73 │            45.10 │    12.53 │
│    CUDA │    master │ 1.10.4 │   Float32 │     7792785 │   0.00 │     4.70 │            44.78 │    12.62 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@vchuravy
Copy link
Author

vchuravy commented Aug 1, 2024

So the default workgroupsize for KA is 1024. With 64 you create a lot of small tasks, what is the typical ndrange you use?

@b-fg
Copy link
Member

b-fg commented Aug 1, 2024

For example, the TGV case is a 3D case for which I tested domain sizes of 64^3 and 128^3. The arrays we use are then (64,64,64) and (64,64,64,3) (analogously for the 128^3 grid), which is the ndrange we typically pass into the kernel. Also, I am not sure I tested this PR before with multi-threading on the CPU backend... I think it was just on thje GPU (as reported previously).

@vchuravy
Copy link
Author

vchuravy commented Aug 2, 2024

Ah so you are getting perfectly sized blocks, by accident xD

You may want to use (64, 64) instead as the workgroup size.

@b-fg
Copy link
Member

b-fg commented Aug 4, 2024

Sure, I will do some tests after my summer break. But does this mean that we cannot use the default workgrup size (as in this PR)? Could this be something to improve in KA, where it would try to automatically determine it based on ndrange?

@vchuravy
Copy link
Author

vchuravy commented Aug 5, 2024

Yeah I will need to improve this on the KA side

@vchuravy
Copy link
Author

vchuravy commented Aug 7, 2024

I just tagged a new KA version with the fix. This might remove the need for the SIMD variant entirely.

@b-fg
Copy link
Member

b-fg commented Aug 21, 2024

I have tested the changes and while the results improve, it is still not there (again, 9b6ca77 is this PR). There might be something else going on but unsure what at the moment...

Benchmarks
Benchmark environment: tgv sim_step! (max_steps=100)
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.42 │           397.53 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.24 │           390.64 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       78733 │   0.00 │    10.29 │           392.71 │     1.01 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     2223302 │   0.00 │     3.34 │           127.43 │     3.12 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1993389 │   0.00 │     4.25 │           162.06 │     2.45 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2274514 │   0.00 │     3.20 │           121.89 │     3.26 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3503858 │   0.00 │     3.24 │           123.48 │     3.22 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     2647077 │   0.00 │     4.41 │           168.25 │     2.36 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3555070 │   0.00 │     3.32 │           126.53 │     3.14 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2621768 │   0.00 │     0.65 │            24.76 │    16.05 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     3026963 │   0.00 │     0.63 │            23.91 │    16.62 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2671140 │   0.00 │     0.68 │            25.79 │    15.42 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 7
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │       70606 │   0.00 │    58.85 │           280.64 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.75 │           275.38 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │       70606 │   0.00 │    57.62 │           274.74 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     1976782 │   0.00 │    20.92 │            99.76 │     2.81 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     1819590 │   0.00 │    24.61 │           117.34 │     2.39 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     2021882 │   0.00 │    21.37 │           101.89 │     2.75 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     3114382 │   0.00 │    19.24 │            91.74 │     3.06 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     2737306 │   0.00 │    25.67 │           122.38 │     2.29 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     3159482 │   0.00 │    22.54 │           107.47 │     2.61 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     2303706 │   0.00 │     3.09 │            14.71 │    19.07 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     2680490 │   0.00 │     3.25 │            15.48 │    18.12 │
│    CUDA │    master │ 1.10.4 │   Float32 │     2347008 │   0.00 │     3.16 │            15.05 │    18.65 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
Benchmark environment: jelly sim_step! (max_steps=100)
▶ log2p = 5
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.89 │           601.60 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.71 │           588.41 │     1.02 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      196756 │   0.00 │     7.71 │           588.06 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     5428158 │   1.59 │     4.54 │           346.53 │     1.74 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     3954012 │   0.00 │     4.35 │           331.67 │     1.81 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     5549932 │   1.49 │     4.73 │           361.22 │     1.67 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │     8524182 │   0.00 │     4.85 │           369.75 │     1.63 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     5237268 │   0.00 │     5.04 │           384.33 │     1.57 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │     8645956 │   2.37 │     5.15 │           392.91 │     1.53 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     6416699 │   0.00 │     1.45 │           110.72 │     5.43 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     7253470 │   0.00 │     1.47 │           112.14 │     5.36 │
│    CUDA │    master │ 1.10.4 │   Float32 │     6538380 │   0.00 │     1.48 │           112.78 │     5.33 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘
▶ log2p = 6
┌─────────┬───────────┬────────┬───────────┬─────────────┬────────┬──────────┬──────────────────┬──────────┐
│ Backend │ WaterLily │ Julia  │ Precision │ Allocations │ GC [%] │ Time [s] │ Cost [ns/DOF/dt] │ Speed-up │
├─────────┼───────────┼────────┼───────────┼─────────────┼────────┼──────────┼──────────────────┼──────────┤
│  CPUx01 │  backends │ 1.10.4 │   Float32 │      230016 │   0.00 │    59.27 │           565.22 │     1.00 │
│  CPUx01 │   9b6ca77 │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.71 │           559.87 │     1.01 │
│  CPUx01 │    master │ 1.10.4 │   Float32 │      230016 │   0.00 │    58.31 │           556.06 │     1.02 │
│  CPUx04 │  backends │ 1.10.4 │   Float32 │     6408293 │   0.37 │    20.44 │           194.94 │     2.90 │
│  CPUx04 │   9b6ca77 │ 1.10.4 │   Float32 │     5129556 │   0.00 │    29.92 │           285.32 │     1.98 │
│  CPUx04 │    master │ 1.10.4 │   Float32 │     6552597 │   0.39 │    21.95 │           209.37 │     2.70 │
│  CPUx08 │  backends │ 1.10.4 │   Float32 │    10078241 │   0.81 │    21.45 │           204.58 │     2.76 │
│  CPUx08 │   9b6ca77 │ 1.10.4 │   Float32 │     7343892 │   0.00 │    30.12 │           287.21 │     1.97 │
│  CPUx08 │    master │ 1.10.4 │   Float32 │    10222545 │   0.61 │    22.87 │           218.09 │     2.59 │
│    CUDA │  backends │ 1.10.4 │   Float32 │     7642918 │   0.00 │     4.69 │            44.69 │    12.65 │
│    CUDA │   9b6ca77 │ 1.10.4 │   Float32 │     8717354 │   0.00 │     4.77 │            45.46 │    12.43 │
│    CUDA │    master │ 1.10.4 │   Float32 │     7787222 │   0.00 │     4.70 │            44.86 │    12.60 │
└─────────┴───────────┴────────┴───────────┴─────────────┴────────┴──────────┴──────────────────┴──────────┘

@marinlauber
Copy link
Member

marinlauber commented Sep 13, 2024

@b-fg Something I picked up out today. Currently, the maintest will run on CuArray only if you have nvcc installed. If you use the Julia CUDA compiler it doesn't install nvcc (at least not on my system). Same goes for AMG GPU I suppose.

It's kind of related to this I suppose, that's why I added it here.

@b-fg
Copy link
Member

b-fg commented Sep 13, 2024

Ah but this is not a problem of this PR, but a problem of WaterLily-Benchmarks, right? If you open an issue there, we can iterate on it.

You mean these test lines , right?

_cuda = check_compiler("nvcc","release")

This is not related to this PR though. The problem is how to automatically detect that CUDA is available without loading CUDA.jl before, and come up with something that works for all OS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants