-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster, cleaner pressure solvers for all topologies on CPUs and GPUs #1338
Conversation
I think this makes sense given my primitive understanding of how FFTW picks optimal plans for the particular problem its asked to solve. |
Tests all pass but code is a bit messy (especially Ran new FFT-based Poisson solver benchmarks on Tartarus (Titan V GPUs) and static ocean benchmarks for all topologies on Satori (Tesla V100 GPUs) + regression. Results are below. Will post a followup with some highlights/conclusions. FFT-based Poisson solver benchmarksRaw numbers
CPU to GPU speedup
CPU slowdown (vs. triply-periodic)
GPU slowdown (vs. triply-periodic)
Static ocean benchmarks for all topologiesRaw numbers
CPU to GPU speedup
CPU slowdown (vs. triply-periodic)
GPU slowdown (vs. triply-periodic)
Performance vs. main branchMain branch
This PR/branch
|
Some highlights/conclusions:
|
very relevant for #1085 |
You got it. |
src/Solvers/discrete_transforms.jl
Outdated
dims :: Δ | ||
topology :: Ω | ||
normalization :: N | ||
twiddle :: T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a technical name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. I'll rename to twiddle_factors
which is still pretty technical but has the benefit of having a Wikipedia page: https://en.wikipedia.org/wiki/Twiddle_factor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, I was joking but it is a technical name. Could write
twiddle_factors :: T # https://en.wikipedia.org/wiki/Twiddle_factor
periodic_dims = findall(t -> t == Periodic, topo) | ||
bounded_dims = findall(t -> t == Bounded, topo) | ||
|
||
if arch isa GPU && topo in non_batched_topologies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😱
) | ||
end | ||
|
||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the case (Periodic, Periodic, Periodic)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case where batching transforms is possible. It's always possible on the CPU since FFTW is awesome so it includes all topologies on the CPU.
On the GPU batching is possible when the topology is not one of non_batched_topologies
(where an FFT is needed along dimension 2), so it includes (Periodic, Periodic, Periodic)
, (Periodic, Periodic, Bounded)
, and (Bounded, Periodic, Periodic)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future generations may thank you if you put this comment in the code
@@ -0,0 +1,23 @@ | |||
function solve_poisson_equation!(solver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a docstring that says something like "we solve ∇²ϕ = RHS
"?
# Setting DC component of the solution (the mean) to be zero. This is also | ||
# necessary because the source term to the Poisson equation has zero mean | ||
# and so the DC component comes out to be ∞. | ||
CUDA.@allowscalar ϕ[1, 1, 1] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the problem that λx[1, 1, 1] + λy[1, 1, 1] + λz[1, 1, 1] = 0
? If RHS[1, 1, 1] = 0
we get NaN
, otherwise we'd get Inf
(and we can't have either). I'm not 100% sure what DC refers to but I think it's sufficient to mention that, in eigenspace, ϕ[1, 1, 1]
is the "zeroth mode" corresponding to the volume mean of the transform of ϕ
, or of ϕ
in physical space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly, so you get one NaN which then turns the entire array into NaNs once you apply an inverse transform.
Ah sorry DC component = direct current component lol, I guess referring to the fact that there's a non-zero mean.
The physical reasoning I've had for why this step is needed is that solutions to Poisson's equation are only unique up to a constant (the global mean of the solution), so we need to pick a constant. ϕ[1, 1, 1] = 0
chooses the constant to be zero so that the solution has zero-mean.
I guess we always take gradients of the pressure so the constant theoretically shouldn't matter.
I'll improve the comment and add your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining as I thought DC was discrete cosine, but since we are talking about the mean, I guess that would be the first component of the discrete cosine transform?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's there whether we're doing FFTs for Periodic
or DCTs for Bounded
but yeah I think in both cases it's the first (zeroth?) component of the transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The physical reasoning I've had for why this step is needed is that solutions to Poisson's equation are only unique up to a constant
I agree with this! We also know that adding a constant to pressure leaves the problem unchanged (at least for the equation of state we use...)
src/Solvers/solve_for_pressure.jl
Outdated
""" | ||
@kernel function copy_pressure!(p, ϕ, solver_type, arch, grid) | ||
@kernel function copy_pressure!(p, ϕ, arch, grid::AbstractGrid{FT, TX, TY, TZ}) where {FT, TX, TY, TZ} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need FT, TX, TY, TZ
? Maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah not sure why that's there. Will remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work @ali-ramadhan !
Definitely new release after this. |
This PR is still a work-in-progress but I'm opening it to make the future design of the pressure solver module more transparent as we will be adding some new pressure solvers soon, including a conjugate-gradient solver by @christophernhill.
Motivation
In PR #290 I implemented a pressure solver for the
(Periodic, Bounded, Bounded)
channel topology using the 2D fast cosine transform algorithm described by Makhoul (1982) as CUFFT does not provide cosine transforms for the GPU and does not support FFTs along non-batched dimensions (see JuliaGPU/CUDA.jl#119).This has been a very unpopular decision for good reasons. The 2D DCT algorithm is quite slow (channels are ~2x slower than doubly periodic on GPUs) and is quite complicated. Due to my inexperience, I didn't realize that transposing the array to do the FFT was the way forward.
The pressure solver module is also quite out of date, it hasn't been updated since topologies were introduced (#614) almost exactly a year ago.
This PR refactors the pressure solver module to:
Resolves #586
Resolves #593
Resolves #594
Resolves #1007
To batch or not to batch for FFTW on CPUs?
TODO:
To see whether we should just do 1D transforms for everything or whether batching is faster I ran some 1D and 3D FFT benchmarks. The results for triply-periodic are posted below.
Based on the benchmarks, it seems that for 256^3 doing three 1D transforms is ~15% slower than doing one 3D transform. So it makes sense to batch transforms when possible.
Note than FFT along dimension 1 is the fastest and FFT along dimension 2 is the slowest. FFT along dimension 3 is in the middle. So whatever FFTW is doing under the hood, FFTs along non-batched dimensions (e.g. along dimension 2) are slow.
To batch or not to batch for CUFFT on GPUs?
We should investigate this separately for CUFFT since FFT along dimension 2 requires a transpose.
TODO:
Same benchmarks for the GPU are posted below. Batching is much faster (by a factor of 2-3) so we should batch when possible.
Note that FFTs along non-batched dimensions (dimension 2 in this case) are much slower since it involves two transpose operations.
Batching will not be possible for some topologies in which cases so we'll take a performance hit. But if the pressure solver is still 10~15% then a 2x hit on the pressure solver is not that large. The hit will mostly affect topologies we don't currently support anyways.