Faster, cleaner pressure solvers for all topologies on CPUs and GPUs #1338

ali-ramadhan · 2021-02-03T18:06:11Z

This PR is still a work-in-progress but I'm opening it to make the future design of the pressure solver module more transparent as we will be adding some new pressure solvers soon, including a conjugate-gradient solver by @christophernhill.

Motivation

In PR #290 I implemented a pressure solver for the (Periodic, Bounded, Bounded) channel topology using the 2D fast cosine transform algorithm described by Makhoul (1982) as CUFFT does not provide cosine transforms for the GPU and does not support FFTs along non-batched dimensions (see JuliaGPU/CUDA.jl#119).

This has been a very unpopular decision for good reasons. The 2D DCT algorithm is quite slow (channels are ~2x slower than doubly periodic on GPUs) and is quite complicated. Due to my inexperience, I didn't realize that transposing the array to do the FFT was the way forward.

The pressure solver module is also quite out of date, it hasn't been updated since topologies were introduced (#614) almost exactly a year ago.

This PR refactors the pressure solver module to:

Support all topologies on the CPU and GPU performing transposes and index permutations as needed by each transform.
Use the fastest transforms as allowed by the topology. This means batching dimensions when possible.
Consolidating all pressure solvers into a single solver for all topologies. This should simplify the code and make it easier to extend.

Resolves #586
Resolves #593
Resolves #594
Resolves #1007

To batch or not to batch for FFTW on CPUs?

TODO:

Benchmark 1D {FFT, IFFT}{x, y, z}.
Benchmark 3D {FFT, IFFT}
Benchmark 1D {DCT, IDCT}{x, y, z}.
Benchmark 3D {DCT, IDCT}
Try N = 16, 64, 256
Is it faster to do 3 1D transforms or 1 3D transform? Answer: 1 3D transform.

To see whether we should just do 1D transforms for everything or whether batching is faster I ran some 1D and 3D FFT benchmarks. The results for triply-periodic are posted below.

Based on the benchmarks, it seems that for 256^3 doing three 1D transforms is ~15% slower than doing one 3D transform. So it makes sense to batch transforms when possible.

Note than FFT along dimension 1 is the fastest and FFT along dimension 2 is the slowest. FFT along dimension 3 is in the middle. So whatever FFTW is doing under the hood, FFTs along non-batched dimensions (e.g. along dimension 2) are slow.

                                               FFT benchmarks
┌───────────────┬─────┬───────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ Architectures │  Ns │      dims │        min │     median │       mean │        max │    memory │ allocs │
├───────────────┼─────┼───────────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┤
│           CPU │  16 │ (1, 2, 3) │  13.948 μs │  14.043 μs │  20.717 μs │  80.605 μs │   0 bytes │      0 │
│           CPU │  64 │ (1, 2, 3) │   1.656 ms │   1.717 ms │   1.809 ms │   2.697 ms │   0 bytes │      0 │
│           CPU │ 256 │ (1, 2, 3) │ 229.619 ms │ 233.008 ms │ 234.033 ms │ 243.288 ms │   0 bytes │      0 │
│           CPU │  16 │         1 │   3.240 μs │   3.255 μs │   3.603 μs │   6.746 μs │   0 bytes │      0 │
│           CPU │  64 │         1 │ 445.803 μs │ 458.928 μs │ 513.041 μs │ 755.937 μs │   0 bytes │      0 │
│           CPU │ 256 │         1 │  61.083 ms │  63.464 ms │  63.969 ms │  67.009 ms │   0 bytes │      0 │
│           CPU │  16 │         2 │   4.085 μs │   4.135 μs │   4.723 μs │   8.088 μs │   0 bytes │      0 │
│           CPU │  64 │         2 │ 564.769 μs │ 579.278 μs │ 615.731 μs │ 804.859 μs │   0 bytes │      0 │
│           CPU │ 256 │         2 │ 110.718 ms │ 111.560 ms │ 111.506 ms │ 112.525 ms │   0 bytes │      0 │
│           CPU │  16 │         3 │   7.772 μs │   7.787 μs │   9.499 μs │  24.886 μs │   0 bytes │      0 │
│           CPU │  64 │         3 │ 684.541 μs │ 688.275 μs │ 811.874 μs │   1.463 ms │   0 bytes │      0 │
│           CPU │ 256 │         3 │  93.902 ms │  94.489 ms │  94.604 ms │  95.639 ms │   0 bytes │      0 │
└───────────────┴─────┴───────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┘

3D FFT --> 3 × 1D FFTs slowdown:
CPU,  16: 1.0807x
CPU,  64: 1.0053x
CPU, 256: 1.1567x

To batch or not to batch for CUFFT on GPUs?

We should investigate this separately for CUFFT since FFT along dimension 2 requires a transpose.

TODO:

Figure out how to do a FFT_y on the GPU!
Implement and benchmark doing it the distributed way.
Benchmark 1 3D FFT with 3 1D FFTs.
Benchmark 1 3D DCT with 3 1D DCTs.

Same benchmarks for the GPU are posted below. Batching is much faster (by a factor of 2-3) so we should batch when possible.

Note that FFTs along non-batched dimensions (dimension 2 in this case) are much slower since it involves two transpose operations.

Batching will not be possible for some topologies in which cases so we'll take a performance hit. But if the pressure solver is still 10~15% then a 2x hit on the pressure solver is not that large. The hit will mostly affect topologies we don't currently support anyways.

                                               FFT benchmarks
┌───────────────┬─────┬───────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ Architectures │  Ns │      dims │        min │     median │       mean │        max │    memory │ allocs │
├───────────────┼─────┼───────────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┤
│           GPU │  16 │ (1, 2, 3) │  25.478 μs │  32.459 μs │ 122.062 μs │ 703.376 μs │ 224 bytes │     13 │
│           GPU │  64 │ (1, 2, 3) │  67.226 μs │  71.497 μs │ 146.042 μs │ 647.734 μs │ 224 bytes │     13 │
│           GPU │ 256 │ (1, 2, 3) │   2.982 ms │   3.041 ms │   3.036 ms │   3.116 ms │ 224 bytes │     13 │
│           GPU │  16 │         1 │  14.755 μs │  30.020 μs │ 107.932 μs │ 677.045 μs │  96 bytes │      5 │
│           GPU │  64 │         1 │  26.521 μs │  41.294 μs │ 114.587 μs │ 674.834 μs │  96 bytes │      5 │
│           GPU │ 256 │         1 │ 930.371 μs │ 936.222 μs │ 954.771 μs │   1.060 ms │  96 bytes │      5 │
│           GPU │  16 │         2 │  26.547 μs │  49.440 μs │ 127.426 μs │ 768.771 μs │  1.41 KiB │     59 │
│           GPU │  64 │         2 │ 116.160 μs │ 117.772 μs │ 193.909 μs │ 797.293 μs │  1.41 KiB │     59 │
│           GPU │ 256 │         2 │   4.963 ms │   5.010 ms │   5.014 ms │   5.073 ms │  1.41 KiB │     59 │
│           GPU │  16 │         3 │  14.918 μs │  22.509 μs │  40.029 μs │ 110.119 μs │ 224 bytes │     13 │
│           GPU │  64 │         3 │  40.151 μs │  45.495 μs │ 124.422 μs │ 646.093 μs │ 224 bytes │     13 │
│           GPU │ 256 │         3 │   1.062 ms │   1.067 ms │   1.101 ms │   1.292 ms │ 224 bytes │     13 │
└───────────────┴─────┴───────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┘

3D FFT --> 3 × 1D FFTs slowdown:
GPU,  16: 3.1414x
GPU,  64: 2.8611x
GPU, 256: 2.3062x

glwagner · 2021-02-03T20:37:59Z

Based on the benchmarks, it seems that for 256^3 doing three 1D transforms is ~15% slower than doing one 3D transform. So it makes sense to batch transforms when possible.

I think this makes sense given my primitive understanding of how FFTW picks optimal plans for the particular problem its asked to solve.

ali-ramadhan · 2021-02-04T15:29:35Z

Tests all pass but code is a bit messy (especially plan_transforms.jl) due to lots of special cases that I'm not yet sure how to simplify so needs some work.

Ran new FFT-based Poisson solver benchmarks on Tartarus (Titan V GPUs) and static ocean benchmarks for all topologies on Satori (Tesla V100 GPUs) + regression. Results are below. Will post a followup with some highlights/conclusions.

FFT-based Poisson solver benchmarks

Raw numbers

                                               FFT-based Poisson solver benchmarks
┌───────────────┬─────┬────────────────────────────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ Architectures │  Ns │                     Topologies │        min │     median │       mean │        max │    memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │ 560.466 ms │ 563.145 ms │ 563.074 ms │ 566.700 ms │ 192 bytes │      4 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │ 434.408 ms │ 435.974 ms │ 437.003 ms │ 441.246 ms │ 160 bytes │      2 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │ 472.312 ms │ 473.340 ms │ 473.620 ms │ 475.649 ms │ 160 bytes │      2 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │ 333.460 ms │ 334.702 ms │ 334.998 ms │ 336.918 ms │ 160 bytes │      2 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │ 495.012 ms │ 497.853 ms │ 497.462 ms │ 500.181 ms │ 160 bytes │      2 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │ 363.169 ms │ 365.104 ms │ 365.891 ms │ 373.893 ms │ 160 bytes │      2 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │ 349.305 ms │ 350.431 ms │ 352.641 ms │ 371.861 ms │ 160 bytes │      2 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │ 203.109 ms │ 203.653 ms │ 204.025 ms │ 206.834 ms │ 192 bytes │      4 │
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │   7.765 ms │  16.841 ms │  15.934 ms │  16.872 ms │ 84.00 KiB │    904 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │   6.492 ms │  13.599 ms │  12.878 ms │  13.633 ms │ 57.50 KiB │    651 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │   6.432 ms │  13.616 ms │  12.883 ms │  13.640 ms │ 57.31 KiB │    645 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │   9.430 ms │  19.467 ms │  18.452 ms │  19.582 ms │ 27.84 KiB │    294 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │   6.330 ms │  13.532 ms │  12.824 ms │  13.642 ms │ 57.50 KiB │    651 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │   4.882 ms │  10.317 ms │   9.772 ms │  10.332 ms │ 30.63 KiB │    386 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │   3.424 ms │   7.083 ms │   6.713 ms │   7.175 ms │ 27.84 KiB │    294 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │   1.843 ms │   3.482 ms │   3.318 ms │   3.489 ms │  1.09 KiB │     31 │
└───────────────┴─────┴────────────────────────────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┘

CPU to GPU speedup

             FFT-based Poisson solver CPU -> GPU speedup
┌─────┬────────────────────────────────┬─────────┬─────────┬────────┐
│  Ns │                     Topologies │ speedup │  memory │ allocs │
├─────┼────────────────────────────────┼─────────┼─────────┼────────┤
│ 192 │    (Bounded, Bounded, Bounded) │ 33.4393 │   448.0 │  226.0 │
│ 192 │   (Bounded, Bounded, Periodic) │ 32.0602 │   368.0 │  325.5 │
│ 192 │   (Bounded, Periodic, Bounded) │ 34.7631 │   366.8 │  322.5 │
│ 192 │  (Bounded, Periodic, Periodic) │ 17.1932 │   178.2 │  147.0 │
│ 192 │   (Periodic, Bounded, Bounded) │ 36.7915 │   368.0 │  325.5 │
│ 192 │  (Periodic, Bounded, Periodic) │ 35.3884 │   196.0 │  193.0 │
│ 192 │  (Periodic, Periodic, Bounded) │ 49.4769 │   178.2 │  147.0 │
│ 192 │ (Periodic, Periodic, Periodic) │ 58.4816 │ 5.83333 │   7.75 │
└─────┴────────────────────────────────┴─────────┴─────────┴────────┘

CPU slowdown (vs. triply-periodic)

                  FFT-based Poisson solver relative performance (CPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬──────────┬────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │   memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼──────────┼────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │  2.76522 │      1.0 │    1.0 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │  2.14077 │ 0.833333 │    0.5 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │  2.32425 │ 0.833333 │    0.5 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │  1.64349 │ 0.833333 │    0.5 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │  2.44462 │ 0.833333 │    0.5 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │  1.79278 │ 0.833333 │    0.5 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │  1.72073 │ 0.833333 │    0.5 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │      1.0 │    1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴──────────┴────────┘

GPU slowdown (vs. triply-periodic)

                  FFT-based Poisson solver relative performance (GPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬─────────┬─────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │  memory │  allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼─────────┼─────────┤
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │  4.83605 │    76.8 │ 29.1613 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │  3.90501 │ 52.5714 │    21.0 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │  3.91006 │    52.4 │ 20.8065 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │  5.59024 │ 25.4571 │ 9.48387 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │  3.88581 │ 52.5714 │    21.0 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │  2.96267 │    28.0 │ 12.4516 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │  2.03389 │ 25.4571 │ 9.48387 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │     1.0 │     1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴─────────┴─────────┘

Static ocean benchmarks for all topologies

Raw numbers

                                                    Topologies benchmarks
┌───────────────┬─────┬────────────────────────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬────────┐
│ Architectures │  Ns │                     Topologies │       min │    median │      mean │       max │     memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │   2.402 s │   2.412 s │   2.413 s │   2.424 s │ 405.84 KiB │   2460 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │   2.247 s │   2.250 s │   2.252 s │   2.259 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │   1.890 s │   1.890 s │   1.890 s │   1.890 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │   1.923 s │   1.933 s │   1.931 s │   1.936 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.864 s │   1.869 s │   1.868 s │   1.871 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │   1.685 s │   1.686 s │   1.688 s │   1.693 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │   2.092 s │   2.114 s │   2.109 s │   2.121 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │   1.780 s │   1.796 s │   1.801 s │   1.828 s │ 277.47 KiB │   1662 │
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │ 14.888 ms │ 22.339 ms │ 21.623 ms │ 22.605 ms │ 913.08 KiB │   9570 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │ 13.213 ms │ 18.322 ms │ 17.794 ms │ 18.914 ms │ 927.17 KiB │   8415 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │ 12.658 ms │ 18.326 ms │ 17.780 ms │ 18.495 ms │ 930.73 KiB │   8289 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │ 15.206 ms │ 22.759 ms │ 21.995 ms │ 22.799 ms │ 935.36 KiB │   6976 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │ 12.717 ms │ 18.315 ms │ 17.841 ms │ 18.997 ms │ 930.92 KiB │   8295 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │ 12.404 ms │ 15.545 ms │ 15.266 ms │ 15.972 ms │ 938.14 KiB │   7068 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │ 10.097 ms │ 13.083 ms │ 12.793 ms │ 13.159 ms │ 939.77 KiB │   6898 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │  8.948 ms │ 10.050 ms │  9.948 ms │ 10.128 ms │ 945.39 KiB │   5625 │
└───────────────┴─────┴────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴────────┘

CPU to GPU speedup

                    Topologies CPU -> GPU speedup
┌─────┬────────────────────────────────┬─────────┬─────────┬─────────┐
│  Ns │                     Topologies │ speedup │  memory │  allocs │
├─────┼────────────────────────────────┼─────────┼─────────┼─────────┤
│ 192 │    (Bounded, Bounded, Bounded) │ 107.967 │ 2.24983 │ 3.89024 │
│ 192 │   (Bounded, Bounded, Periodic) │ 122.789 │ 2.55222 │ 3.89223 │
│ 192 │   (Bounded, Periodic, Bounded) │ 103.133 │ 2.56202 │ 3.83395 │
│ 192 │  (Bounded, Periodic, Periodic) │  84.934 │ 2.95066 │ 3.86268 │
│ 192 │   (Periodic, Bounded, Bounded) │  102.06 │ 2.56254 │ 3.83673 │
│ 192 │  (Periodic, Bounded, Periodic) │ 108.457 │ 2.95943 │ 3.91362 │
│ 192 │  (Periodic, Periodic, Bounded) │ 161.616 │ 2.96456 │ 3.81949 │
│ 192 │ (Periodic, Periodic, Periodic) │ 178.682 │  3.4072 │ 3.38448 │
└─────┴────────────────────────────────┴─────────┴─────────┴─────────┘

CPU slowdown (vs. triply-periodic)

                         Topologies relative performance (CPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬─────────┬─────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │  memory │  allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼─────────┼─────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │  1.34309 │ 1.46266 │ 1.48014 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │  1.25281 │ 1.30927 │ 1.30084 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │  1.05249 │ 1.30927 │ 1.30084 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │  1.07645 │ 1.14247 │ 1.08664 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.0409 │ 1.30927 │ 1.30084 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │ 0.938853 │ 1.14247 │ 1.08664 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │  1.17749 │ 1.14247 │ 1.08664 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │     1.0 │     1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴─────────┴─────────┘

GPU slowdown (vs. triply-periodic)

                         Topologies relative performance (GPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬──────────┬─────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │   memory │  allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼──────────┼─────────┤
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │  2.22277 │ 0.965821 │ 1.70133 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │  1.82308 │ 0.980729 │   1.496 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │  1.82349 │ 0.984497 │  1.4736 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │  2.26462 │ 0.989389 │ 1.24018 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │  1.82237 │ 0.984695 │ 1.47467 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │  1.54676 │ 0.992331 │ 1.25653 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │  1.30183 │  0.99405 │ 1.22631 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │      1.0 │     1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴──────────┴─────────┘

Performance vs. main branch

Main branch

                                                    Topologies benchmarks
┌───────────────┬─────┬────────────────────────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬────────┐
│ Architectures │  Ns │                     Topologies │       min │    median │      mean │       max │     memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼────────┤
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.922 s │   1.922 s │   1.967 s │   2.058 s │ 363.61 KiB │   2163 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │   2.143 s │   2.144 s │   2.145 s │   2.146 s │ 317.33 KiB │   1807 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │   1.791 s │   1.793 s │   1.793 s │   1.794 s │ 277.77 KiB │   1661 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │ 32.188 ms │ 37.447 ms │ 36.936 ms │ 37.557 ms │ 985.94 KiB │  13476 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │ 11.051 ms │ 11.114 ms │ 11.148 ms │ 11.533 ms │ 807.44 KiB │  10746 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │  9.859 ms │ 10.104 ms │ 10.136 ms │ 10.682 ms │ 707.81 KiB │   9469 │
└───────────────┴─────┴────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴────────┘

This PR/branch

                                                    Topologies benchmarks
┌───────────────┬─────┬────────────────────────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬────────┐
│ Architectures │  Ns │                     Topologies │       min │    median │      mean │       max │     memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼────────┤
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.864 s │   1.869 s │   1.868 s │   1.871 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │   2.092 s │   2.114 s │   2.109 s │   2.121 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │   1.780 s │   1.796 s │   1.801 s │   1.828 s │ 277.47 KiB │   1662 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │ 12.717 ms │ 18.315 ms │ 17.841 ms │ 18.997 ms │ 930.92 KiB │   8295 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │ 10.097 ms │ 13.083 ms │ 12.793 ms │ 13.159 ms │ 939.77 KiB │   6898 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │  8.948 ms │ 10.050 ms │  9.948 ms │ 10.128 ms │ 945.39 KiB │   5625 │
└───────────────┴─────┴────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴────────┘

ali-ramadhan · 2021-02-04T15:52:59Z

Some highlights/conclusions:

(Periodic, Bounded, Bounded) channels used to take ~37 ms/time step on GPUs but are now takes ~18 ms/time step so it's ~2x as fast 🎉
Our favorite (Periodic, Periodic, Bounded) topology slowed down a bit: from ~11 ms to ~13 ms/time step. Might be due to extra kernel launches in the discrete transforms (index permutations are now done in the discrete transforms).
The FFTBasedPoissonSolver allocates quite a bit of memory when DCTs are involved. Should probably see if there are any obvious sources of memory allocations that can be improved but probably not absolutely required.

navidcy · 2021-02-04T22:13:53Z

(Bounded, Bounded, Bounded) on GPU!??

very relevant for #1085

glwagner · 2021-02-04T22:16:04Z

(Bounded, Bounded, Bounded) on GPU!?

You got it.

src/Models/IncompressibleModels/incompressible_model.jl

glwagner · 2021-02-04T22:19:04Z

src/Solvers/discrete_transforms.jl

+              dims :: Δ
+          topology :: Ω
+     normalization :: N
+           twiddle :: T


this is a technical name

True. I'll rename to twiddle_factors which is still pretty technical but has the benefit of having a Wikipedia page: https://en.wikipedia.org/wiki/Twiddle_factor

Wow, I was joking but it is a technical name. Could write

twiddle_factors :: T # https://en.wikipedia.org/wiki/Twiddle_factor

glwagner · 2021-02-04T22:22:32Z

src/Solvers/plan_transforms.jl

+    periodic_dims = findall(t -> t == Periodic, topo)
+    bounded_dims = findall(t -> t == Bounded, topo)
+
+    if arch isa GPU && topo in non_batched_topologies


glwagner · 2021-02-04T22:23:10Z

src/Solvers/plan_transforms.jl

+            )
+        end
+
+    else


Is this the case (Periodic, Periodic, Periodic)?

This is the case where batching transforms is possible. It's always possible on the CPU since FFTW is awesome so it includes all topologies on the CPU.

On the GPU batching is possible when the topology is not one of non_batched_topologies (where an FFT is needed along dimension 2), so it includes (Periodic, Periodic, Periodic), (Periodic, Periodic, Bounded), and (Bounded, Periodic, Periodic).

Future generations may thank you if you put this comment in the code

glwagner · 2021-02-04T22:27:41Z

src/Solvers/solve_poisson_equation.jl

@@ -0,0 +1,23 @@
+function solve_poisson_equation!(solver)


Maybe a docstring that says something like "we solve ∇²ϕ = RHS"?

glwagner · 2021-02-04T22:29:59Z

src/Solvers/solve_poisson_equation.jl

+    # Setting DC component of the solution (the mean) to be zero. This is also
+    # necessary because the source term to the Poisson equation has zero mean
+    # and so the DC component comes out to be ∞.
+    CUDA.@allowscalar ϕ[1, 1, 1] = 0


Isn't the problem that λx[1, 1, 1] + λy[1, 1, 1] + λz[1, 1, 1] = 0? If RHS[1, 1, 1] = 0 we get NaN, otherwise we'd get Inf (and we can't have either). I'm not 100% sure what DC refers to but I think it's sufficient to mention that, in eigenspace, ϕ[1, 1, 1] is the "zeroth mode" corresponding to the volume mean of the transform of ϕ, or of ϕ in physical space.

Yeah exactly, so you get one NaN which then turns the entire array into NaNs once you apply an inverse transform.

Ah sorry DC component = direct current component lol, I guess referring to the fact that there's a non-zero mean.

The physical reasoning I've had for why this step is needed is that solutions to Poisson's equation are only unique up to a constant (the global mean of the solution), so we need to pick a constant. ϕ[1, 1, 1] = 0 chooses the constant to be zero so that the solution has zero-mean.

I guess we always take gradients of the pressure so the constant theoretically shouldn't matter.

I'll improve the comment and add your suggestion.

Thanks for explaining as I thought DC was discrete cosine, but since we are talking about the mean, I guess that would be the first component of the discrete cosine transform?

I think it's there whether we're doing FFTs for Periodic or DCTs for Bounded but yeah I think in both cases it's the first (zeroth?) component of the transform.

The physical reasoning I've had for why this step is needed is that solutions to Poisson's equation are only unique up to a constant

I agree with this! We also know that adding a constant to pressure leaves the problem unchanged (at least for the equation of state we use...)

glwagner · 2021-02-04T22:31:24Z

src/Solvers/solve_for_pressure.jl

 """
-@kernel function copy_pressure!(p, ϕ, solver_type, arch, grid)
+@kernel function copy_pressure!(p, ϕ, arch, grid::AbstractGrid{FT, TX, TY, TZ}) where {FT, TX, TY, TZ}


Why do we need FT, TX, TY, TZ? Maybe I'm missing something.

Yeah not sure why that's there. Will remove.

glwagner

Very nice work @ali-ramadhan !

navidcy · 2021-02-04T22:49:08Z

Definitely new release after this.

ali-ramadhan · 2021-02-08T16:23:24Z

feels pretty good.

ali-ramadhan added 13 commits December 9, 2020 08:35

Ignore benchmark outputs

19b2536

Benchmark 3D FFTs vs. 3 x 1D FFTs

65cd35c

Start refactoring pressure solver

c7e6609

New pressure solver instantiates

b26ccf9

Tests pass for triply periodic

98eec62

Fix normalization factor

0f72d85

Test all the topologies

96beb31

Eigenvalues not wavenumbers

d35b503

GPU pressure solver working for triply periodic

b1661c6

Transforms handle their own normalization

065d9b2

Twiddle factors and more working topologies

95ccab3

Preliminary infrastructure for non-batched GPU transforms

649970e

Finally nuke crufty old pressure solvers!

1e32d8f

ali-ramadhan marked this pull request as draft February 3, 2021 18:38

ali-ramadhan added 2 commits February 3, 2021 15:15

Fix Benchmarks environment

20c1385

Poisson solver sandbox testing

2b8af1c

ali-ramadhan added 13 commits February 3, 2021 16:34

2D sandbox testing looks good

1f4ec24

3D sandbox testing was helpful

cd14de6

Refactor pressure solver divergence-free test

0dea76f

Isolate index permutations and add citations

d530dc0

Move index permutation to DiscreteTransform

385d119

3D pressure solve for non-batch dim finally passes

3f9dd71

Lots of DiscreteTransform dispatch

5d8e821

Apply transforms in tuple order

d475d6e

Sorry about all the if statements...

84cbfdc

Gotta reshape before transposing

c71bcee

Update Poisson solver convergence test

b5df157

Cautiously optimistic that tests will pass

06c612e

Test mode=2 as well

f37462b

ali-ramadhan added 7 commits February 4, 2021 07:11

Merge branch 'master' into ali/pressure-solver-tlc

8aa7c4d

Fix Benchmarks Manifest.toml

5936e8d

Fix tests

72c894f

Benchmark all topologies!

9439a12

Some renaming and now it's FFTBasedPoissonSolver

bd4f8f4

Yet another Cell -> Center

5d0206b

Benchmarking FFT-based Poisson solvers

38d51d7

ali-ramadhan marked this pull request as ready for review February 4, 2021 16:05

glwagner reviewed Feb 4, 2021

View reviewed changes

src/Models/IncompressibleModels/incompressible_model.jl Show resolved Hide resolved

glwagner reviewed Feb 4, 2021

View reviewed changes

glwagner approved these changes Feb 4, 2021

View reviewed changes

ali-ramadhan added 4 commits February 8, 2021 10:46

Address PR review

a9a25ae

Bump version 0.47.0

dbc1042

Address more PR comments and nuke sandbox

e7c6237

Last commit did not show up for some reason...

e953459

ali-ramadhan merged commit 6f5afda into master Feb 8, 2021

ali-ramadhan deleted the ali/pressure-solver-tlc branch February 8, 2021 16:48

navidcy mentioned this pull request Feb 8, 2021

Updates Docs (Grids section) #1343

Merged

ali-ramadhan mentioned this pull request Feb 10, 2021

Begins the implementation of Oceananigans.HydrostaticFreeSurfaceModel for general circulation modeling #1349

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster, cleaner pressure solvers for all topologies on CPUs and GPUs #1338

Faster, cleaner pressure solvers for all topologies on CPUs and GPUs #1338

ali-ramadhan commented Feb 3, 2021 •

edited

Loading

glwagner commented Feb 3, 2021

ali-ramadhan commented Feb 4, 2021

ali-ramadhan commented Feb 4, 2021 •

edited

Loading

navidcy commented Feb 4, 2021 •

edited

Loading

glwagner commented Feb 4, 2021

glwagner Feb 4, 2021

ali-ramadhan Feb 5, 2021

glwagner Feb 8, 2021

glwagner Feb 4, 2021

glwagner Feb 4, 2021

ali-ramadhan Feb 5, 2021

glwagner Feb 8, 2021

glwagner Feb 4, 2021

glwagner Feb 4, 2021

ali-ramadhan Feb 5, 2021

francispoulin Feb 5, 2021

ali-ramadhan Feb 5, 2021

glwagner Feb 8, 2021

glwagner Feb 4, 2021

ali-ramadhan Feb 5, 2021

glwagner left a comment

navidcy commented Feb 4, 2021

ali-ramadhan commented Feb 8, 2021

+                          )
+                      end
+                  else

Faster, cleaner pressure solvers for all topologies on CPUs and GPUs #1338

Faster, cleaner pressure solvers for all topologies on CPUs and GPUs #1338

Conversation

ali-ramadhan commented Feb 3, 2021 • edited Loading

Motivation

To batch or not to batch for FFTW on CPUs?

To batch or not to batch for CUFFT on GPUs?

glwagner commented Feb 3, 2021

ali-ramadhan commented Feb 4, 2021

FFT-based Poisson solver benchmarks

Raw numbers

CPU to GPU speedup

CPU slowdown (vs. triply-periodic)

GPU slowdown (vs. triply-periodic)

Static ocean benchmarks for all topologies

Raw numbers

CPU to GPU speedup

CPU slowdown (vs. triply-periodic)

GPU slowdown (vs. triply-periodic)

Performance vs. main branch

Main branch

This PR/branch

ali-ramadhan commented Feb 4, 2021 • edited Loading

navidcy commented Feb 4, 2021 • edited Loading

glwagner commented Feb 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glwagner left a comment

Choose a reason for hiding this comment

navidcy commented Feb 4, 2021

ali-ramadhan commented Feb 8, 2021

ali-ramadhan commented Feb 3, 2021 •

edited

Loading

ali-ramadhan commented Feb 4, 2021 •

edited

Loading

navidcy commented Feb 4, 2021 •

edited

Loading