From 84a66d29c5fd26628ccdc954fce36831a3f7d336 Mon Sep 17 00:00:00 2001 From: Gabriel Weymouth Date: Fri, 2 Aug 2024 13:15:51 +0200 Subject: [PATCH] Update README.md --- README.md | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index f5e16b7..aa8d42f 100644 --- a/README.md +++ b/README.md @@ -52,7 +52,7 @@ The circle geometry is defined using a [signed distance function](https://en.wik The code block above return a `Simulation` with the parameters we've defined. Now we can initialize a simulation (first line) and step it forward in time (second line) ```julia -circ = circle(3*2^6,2^7) +circ = circle(3*2^5,2^6) sim_step!(circ) ``` Note we've set `n,m` to be multiples of powers of 2, which is important when using the (very fast) geometric multi-grid solver. @@ -110,23 +110,31 @@ WaterLily uses [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstrac Note that multi-threading requires _starting_ Julia with the `--threads` argument, see [the multi-threading section](https://docs.julialang.org/en/v1/manual/multi-threading/) of the manual. If you are running Julia with multiple threads, KernelAbstractions will detect this and multi-thread the loops automatically. -Running on a GPU requires initializing the `Simulation` memory on the GPU, and care needs to be taken to move the data back to the CPU for visualization. For example, the cylinder example above could be run `using CUDA` by +Running on a GPU requires initializing the `Simulation` memory on the GPU, and care needs to be taken to move the data back to the CPU for visualization. As an example, let's compare a 3D GPU simulation of a sphere to the 2D multi-threaded CPU circle defined above ```Julia -function circle(n,m;Re=100,U=1,T=Float64,mem=Array) +using CUDA,WaterLily +function sphere(n,m;Re=100,U=1,T=Float64,mem=Array) radius, center = m/8, m/2-1 body = AutoBody((x,t)->√sum(abs2, x .- center) - radius) - Simulation((n,m),(U,0),2radius;ν=U*2radius/Re,body, + Simulation((n,m,m),(U,0,0),2radius; # 3D array size and BCs mem, # memory type - T) # Floating point type + T, # Floating point type + ν=U*2radius/Re,body) end -using CUDA -@assert CUDA.functional() # is your CUDA GPU working?? -circGPU = circle(3*2^7,2^8;T=Float32,mem=CuArray); # GPU like big -sim_step!(circGPU,100,remeasure=false) # still much faster -contourf(Array(circGPU.flow.u[:,:,1])') # copy to CPU before plotting +@assert CUDA.functional() # is your CUDA GPU working?? +GPUsim = sphere(3*2^5,2^6;T=Float32,mem=CuArray); # 3D GPU sim! +println(length(GPUsim.flow.u)) # 1.3M degrees-of freedom! +sim_step!(GPUsim) # compile GPU code & run one step +@time sim_step!(GPUsim,50,remeasure=false) # 40s!! + +CPUsim = circle(3*2^5,2^6); # 2D CPU sim +println(length(CPUsim.flow.u)) # 0.013M degrees-of freedom! +sim_step!(CPUsim) # compile GPU code & run one step +println(Threads.nthreads()) # I'm using 8 threads +@time sim_step!(CPUsim,50,remeasure=false) # 28s!! ``` -See the [2024 paper](https://physics.paperswithcode.com/paper/waterlily-jl-a-differentiable-and-backend) and the [examples repo](https://github.com/WaterLily-jl/WaterLily-Examples) for many more non-trivial examples including running on AMD GPUs. +As you can see, the 3D sphere set-up is almost identical to the 2D circle, but using 3D arrays means there are almost 1.3M degrees-of-freedom, 100x bigger than in 2D. Never the less, the simulation is quite fast on the GPU, only around 40% slower than the much smaller 2D simulation on a CPU with 8 threads. See the [2024 paper](https://physics.paperswithcode.com/paper/waterlily-jl-a-differentiable-and-backend) and the [examples repo](https://github.com/WaterLily-jl/WaterLily-Examples) for many more non-trivial examples including running on AMD GPUs. Finally, KernelAbstractions does incur some CPU allocations for every loop, but other than this `sim_step!` is completely non-allocating. This is one reason why the speed-up improves as the size of the simulation increases.