Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster, cleaner pressure solvers for all topologies on CPUs and GPUs #1338

Merged
merged 40 commits into from
Feb 8, 2021

Conversation

ali-ramadhan
Copy link
Member

@ali-ramadhan ali-ramadhan commented Feb 3, 2021

This PR is still a work-in-progress but I'm opening it to make the future design of the pressure solver module more transparent as we will be adding some new pressure solvers soon, including a conjugate-gradient solver by @christophernhill.

Motivation

In PR #290 I implemented a pressure solver for the (Periodic, Bounded, Bounded) channel topology using the 2D fast cosine transform algorithm described by Makhoul (1982) as CUFFT does not provide cosine transforms for the GPU and does not support FFTs along non-batched dimensions (see JuliaGPU/CUDA.jl#119).

This has been a very unpopular decision for good reasons. The 2D DCT algorithm is quite slow (channels are ~2x slower than doubly periodic on GPUs) and is quite complicated. Due to my inexperience, I didn't realize that transposing the array to do the FFT was the way forward.

The pressure solver module is also quite out of date, it hasn't been updated since topologies were introduced (#614) almost exactly a year ago.

This PR refactors the pressure solver module to:

  1. Support all topologies on the CPU and GPU performing transposes and index permutations as needed by each transform.
  2. Use the fastest transforms as allowed by the topology. This means batching dimensions when possible.
  3. Consolidating all pressure solvers into a single solver for all topologies. This should simplify the code and make it easier to extend.

Resolves #586
Resolves #593
Resolves #594
Resolves #1007

To batch or not to batch for FFTW on CPUs?

TODO:

  • Benchmark 1D {FFT, IFFT}{x, y, z}.
  • Benchmark 3D {FFT, IFFT}
  • Benchmark 1D {DCT, IDCT}{x, y, z}.
  • Benchmark 3D {DCT, IDCT}
  • Try N = 16, 64, 256
  • Is it faster to do 3 1D transforms or 1 3D transform? Answer: 1 3D transform.

To see whether we should just do 1D transforms for everything or whether batching is faster I ran some 1D and 3D FFT benchmarks. The results for triply-periodic are posted below.

Based on the benchmarks, it seems that for 256^3 doing three 1D transforms is ~15% slower than doing one 3D transform. So it makes sense to batch transforms when possible.

Note than FFT along dimension 1 is the fastest and FFT along dimension 2 is the slowest. FFT along dimension 3 is in the middle. So whatever FFTW is doing under the hood, FFTs along non-batched dimensions (e.g. along dimension 2) are slow.

                                               FFT benchmarks
┌───────────────┬─────┬───────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ Architectures │  Ns │      dims │        min │     median │       mean │        max │    memory │ allocs │
├───────────────┼─────┼───────────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┤
│           CPU │  16 │ (1, 2, 3) │  13.948 μs │  14.043 μs │  20.717 μs │  80.605 μs │   0 bytes │      0 │
│           CPU │  64 │ (1, 2, 3) │   1.656 ms │   1.717 ms │   1.809 ms │   2.697 ms │   0 bytes │      0 │
│           CPU │ 256 │ (1, 2, 3) │ 229.619 ms │ 233.008 ms │ 234.033 ms │ 243.288 ms │   0 bytes │      0 │
│           CPU │  16 │         1 │   3.240 μs │   3.255 μs │   3.603 μs │   6.746 μs │   0 bytes │      0 │
│           CPU │  64 │         1 │ 445.803 μs │ 458.928 μs │ 513.041 μs │ 755.937 μs │   0 bytes │      0 │
│           CPU │ 256 │         1 │  61.083 ms │  63.464 ms │  63.969 ms │  67.009 ms │   0 bytes │      0 │
│           CPU │  16 │         2 │   4.085 μs │   4.135 μs │   4.723 μs │   8.088 μs │   0 bytes │      0 │
│           CPU │  64 │         2 │ 564.769 μs │ 579.278 μs │ 615.731 μs │ 804.859 μs │   0 bytes │      0 │
│           CPU │ 256 │         2 │ 110.718 ms │ 111.560 ms │ 111.506 ms │ 112.525 ms │   0 bytes │      0 │
│           CPU │  16 │         3 │   7.772 μs │   7.787 μs │   9.499 μs │  24.886 μs │   0 bytes │      0 │
│           CPU │  64 │         3 │ 684.541 μs │ 688.275 μs │ 811.874 μs │   1.463 ms │   0 bytes │      0 │
│           CPU │ 256 │         3 │  93.902 ms │  94.489 ms │  94.604 ms │  95.639 ms │   0 bytes │      0 │
└───────────────┴─────┴───────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┘

3D FFT --> 3 × 1D FFTs slowdown:
CPU,  16: 1.0807x
CPU,  64: 1.0053x
CPU, 256: 1.1567x

To batch or not to batch for CUFFT on GPUs?

We should investigate this separately for CUFFT since FFT along dimension 2 requires a transpose.

TODO:

  • Figure out how to do a FFT_y on the GPU!
  • Implement and benchmark doing it the distributed way.
  • Benchmark 1 3D FFT with 3 1D FFTs.
  • Benchmark 1 3D DCT with 3 1D DCTs.

Same benchmarks for the GPU are posted below. Batching is much faster (by a factor of 2-3) so we should batch when possible.

Note that FFTs along non-batched dimensions (dimension 2 in this case) are much slower since it involves two transpose operations.

Batching will not be possible for some topologies in which cases so we'll take a performance hit. But if the pressure solver is still 10~15% then a 2x hit on the pressure solver is not that large. The hit will mostly affect topologies we don't currently support anyways.

                                               FFT benchmarks
┌───────────────┬─────┬───────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ Architectures │  Ns │      dims │        min │     median │       mean │        max │    memory │ allocs │
├───────────────┼─────┼───────────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┤
│           GPU │  16 │ (1, 2, 3) │  25.478 μs │  32.459 μs │ 122.062 μs │ 703.376 μs │ 224 bytes │     13 │
│           GPU │  64 │ (1, 2, 3) │  67.226 μs │  71.497 μs │ 146.042 μs │ 647.734 μs │ 224 bytes │     13 │
│           GPU │ 256 │ (1, 2, 3) │   2.982 ms │   3.041 ms │   3.036 ms │   3.116 ms │ 224 bytes │     13 │
│           GPU │  16 │         1 │  14.755 μs │  30.020 μs │ 107.932 μs │ 677.045 μs │  96 bytes │      5 │
│           GPU │  64 │         1 │  26.521 μs │  41.294 μs │ 114.587 μs │ 674.834 μs │  96 bytes │      5 │
│           GPU │ 256 │         1 │ 930.371 μs │ 936.222 μs │ 954.771 μs │   1.060 ms │  96 bytes │      5 │
│           GPU │  16 │         2 │  26.547 μs │  49.440 μs │ 127.426 μs │ 768.771 μs │  1.41 KiB │     59 │
│           GPU │  64 │         2 │ 116.160 μs │ 117.772 μs │ 193.909 μs │ 797.293 μs │  1.41 KiB │     59 │
│           GPU │ 256 │         2 │   4.963 ms │   5.010 ms │   5.014 ms │   5.073 ms │  1.41 KiB │     59 │
│           GPU │  16 │         3 │  14.918 μs │  22.509 μs │  40.029 μs │ 110.119 μs │ 224 bytes │     13 │
│           GPU │  64 │         3 │  40.151 μs │  45.495 μs │ 124.422 μs │ 646.093 μs │ 224 bytes │     13 │
│           GPU │ 256 │         3 │   1.062 ms │   1.067 ms │   1.101 ms │   1.292 ms │ 224 bytes │     13 │
└───────────────┴─────┴───────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┘

3D FFT --> 3 × 1D FFTs slowdown:
GPU,  16: 3.1414x
GPU,  64: 2.8611x
GPU, 256: 2.3062x

@ali-ramadhan ali-ramadhan marked this pull request as draft February 3, 2021 18:38
@glwagner
Copy link
Member

glwagner commented Feb 3, 2021

Based on the benchmarks, it seems that for 256^3 doing three 1D transforms is ~15% slower than doing one 3D transform. So it makes sense to batch transforms when possible.

I think this makes sense given my primitive understanding of how FFTW picks optimal plans for the particular problem its asked to solve.

@ali-ramadhan
Copy link
Member Author

Tests all pass but code is a bit messy (especially plan_transforms.jl) due to lots of special cases that I'm not yet sure how to simplify so needs some work.

Ran new FFT-based Poisson solver benchmarks on Tartarus (Titan V GPUs) and static ocean benchmarks for all topologies on Satori (Tesla V100 GPUs) + regression. Results are below. Will post a followup with some highlights/conclusions.

FFT-based Poisson solver benchmarks

Raw numbers

                                               FFT-based Poisson solver benchmarks
┌───────────────┬─────┬────────────────────────────────┬────────────┬────────────┬────────────┬────────────┬───────────┬────────┐
│ Architectures │  Ns │                     Topologies │        min │     median │       mean │        max │    memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼────────────┼────────────┼────────────┼────────────┼───────────┼────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │ 560.466 ms │ 563.145 ms │ 563.074 ms │ 566.700 ms │ 192 bytes │      4 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │ 434.408 ms │ 435.974 ms │ 437.003 ms │ 441.246 ms │ 160 bytes │      2 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │ 472.312 ms │ 473.340 ms │ 473.620 ms │ 475.649 ms │ 160 bytes │      2 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │ 333.460 ms │ 334.702 ms │ 334.998 ms │ 336.918 ms │ 160 bytes │      2 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │ 495.012 ms │ 497.853 ms │ 497.462 ms │ 500.181 ms │ 160 bytes │      2 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │ 363.169 ms │ 365.104 ms │ 365.891 ms │ 373.893 ms │ 160 bytes │      2 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │ 349.305 ms │ 350.431 ms │ 352.641 ms │ 371.861 ms │ 160 bytes │      2 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │ 203.109 ms │ 203.653 ms │ 204.025 ms │ 206.834 ms │ 192 bytes │      4 │
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │   7.765 ms │  16.841 ms │  15.934 ms │  16.872 ms │ 84.00 KiB │    904 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │   6.492 ms │  13.599 ms │  12.878 ms │  13.633 ms │ 57.50 KiB │    651 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │   6.432 ms │  13.616 ms │  12.883 ms │  13.640 ms │ 57.31 KiB │    645 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │   9.430 ms │  19.467 ms │  18.452 ms │  19.582 ms │ 27.84 KiB │    294 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │   6.330 ms │  13.532 ms │  12.824 ms │  13.642 ms │ 57.50 KiB │    651 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │   4.882 ms │  10.317 ms │   9.772 ms │  10.332 ms │ 30.63 KiB │    386 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │   3.424 ms │   7.083 ms │   6.713 ms │   7.175 ms │ 27.84 KiB │    294 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │   1.843 ms │   3.482 ms │   3.318 ms │   3.489 ms │  1.09 KiB │     31 │
└───────────────┴─────┴────────────────────────────────┴────────────┴────────────┴────────────┴────────────┴───────────┴────────┘

CPU to GPU speedup

             FFT-based Poisson solver CPU -> GPU speedup
┌─────┬────────────────────────────────┬─────────┬─────────┬────────┐
│  Ns │                     Topologies │ speedup │  memory │ allocs │
├─────┼────────────────────────────────┼─────────┼─────────┼────────┤
│ 192 │    (Bounded, Bounded, Bounded) │ 33.4393 │   448.0 │  226.0 │
│ 192 │   (Bounded, Bounded, Periodic) │ 32.0602 │   368.0 │  325.5 │
│ 192 │   (Bounded, Periodic, Bounded) │ 34.7631 │   366.8 │  322.5 │
│ 192 │  (Bounded, Periodic, Periodic) │ 17.1932 │   178.2 │  147.0 │
│ 192 │   (Periodic, Bounded, Bounded) │ 36.7915 │   368.0 │  325.5 │
│ 192 │  (Periodic, Bounded, Periodic) │ 35.3884 │   196.0 │  193.0 │
│ 192 │  (Periodic, Periodic, Bounded) │ 49.4769 │   178.2 │  147.0 │
│ 192 │ (Periodic, Periodic, Periodic) │ 58.4816 │ 5.83333 │   7.75 │
└─────┴────────────────────────────────┴─────────┴─────────┴────────┘

CPU slowdown (vs. triply-periodic)

                  FFT-based Poisson solver relative performance (CPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬──────────┬────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │   memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼──────────┼────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │  2.76522 │      1.0 │    1.0 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │  2.14077 │ 0.833333 │    0.5 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │  2.32425 │ 0.833333 │    0.5 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │  1.64349 │ 0.833333 │    0.5 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │  2.44462 │ 0.833333 │    0.5 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │  1.79278 │ 0.833333 │    0.5 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │  1.72073 │ 0.833333 │    0.5 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │      1.0 │    1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴──────────┴────────┘

GPU slowdown (vs. triply-periodic)

                  FFT-based Poisson solver relative performance (GPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬─────────┬─────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │  memory │  allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼─────────┼─────────┤
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │  4.83605 │    76.8 │ 29.1613 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │  3.90501 │ 52.5714 │    21.0 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │  3.91006 │    52.4 │ 20.8065 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │  5.59024 │ 25.4571 │ 9.48387 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │  3.88581 │ 52.5714 │    21.0 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │  2.96267 │    28.0 │ 12.4516 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │  2.03389 │ 25.4571 │ 9.48387 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │     1.0 │     1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴─────────┴─────────┘

Static ocean benchmarks for all topologies

Raw numbers

                                                    Topologies benchmarks
┌───────────────┬─────┬────────────────────────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬────────┐
│ Architectures │  Ns │                     Topologies │       min │    median │      mean │       max │     memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │   2.402 s │   2.412 s │   2.413 s │   2.424 s │ 405.84 KiB │   2460 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │   2.247 s │   2.250 s │   2.252 s │   2.259 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │   1.890 s │   1.890 s │   1.890 s │   1.890 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │   1.923 s │   1.933 s │   1.931 s │   1.936 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.864 s │   1.869 s │   1.868 s │   1.871 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │   1.685 s │   1.686 s │   1.688 s │   1.693 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │   2.092 s │   2.114 s │   2.109 s │   2.121 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │   1.780 s │   1.796 s │   1.801 s │   1.828 s │ 277.47 KiB │   1662 │
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │ 14.888 ms │ 22.339 ms │ 21.623 ms │ 22.605 ms │ 913.08 KiB │   9570 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │ 13.213 ms │ 18.322 ms │ 17.794 ms │ 18.914 ms │ 927.17 KiB │   8415 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │ 12.658 ms │ 18.326 ms │ 17.780 ms │ 18.495 ms │ 930.73 KiB │   8289 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │ 15.206 ms │ 22.759 ms │ 21.995 ms │ 22.799 ms │ 935.36 KiB │   6976 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │ 12.717 ms │ 18.315 ms │ 17.841 ms │ 18.997 ms │ 930.92 KiB │   8295 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │ 12.404 ms │ 15.545 ms │ 15.266 ms │ 15.972 ms │ 938.14 KiB │   7068 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │ 10.097 ms │ 13.083 ms │ 12.793 ms │ 13.159 ms │ 939.77 KiB │   6898 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │  8.948 ms │ 10.050 ms │  9.948 ms │ 10.128 ms │ 945.39 KiB │   5625 │
└───────────────┴─────┴────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴────────┘

CPU to GPU speedup

                    Topologies CPU -> GPU speedup
┌─────┬────────────────────────────────┬─────────┬─────────┬─────────┐
│  Ns │                     Topologies │ speedup │  memory │  allocs │
├─────┼────────────────────────────────┼─────────┼─────────┼─────────┤
│ 192 │    (Bounded, Bounded, Bounded) │ 107.967 │ 2.24983 │ 3.89024 │
│ 192 │   (Bounded, Bounded, Periodic) │ 122.789 │ 2.55222 │ 3.89223 │
│ 192 │   (Bounded, Periodic, Bounded) │ 103.133 │ 2.56202 │ 3.83395 │
│ 192 │  (Bounded, Periodic, Periodic) │  84.934 │ 2.95066 │ 3.86268 │
│ 192 │   (Periodic, Bounded, Bounded) │  102.06 │ 2.56254 │ 3.83673 │
│ 192 │  (Periodic, Bounded, Periodic) │ 108.457 │ 2.95943 │ 3.91362 │
│ 192 │  (Periodic, Periodic, Bounded) │ 161.616 │ 2.96456 │ 3.81949 │
│ 192 │ (Periodic, Periodic, Periodic) │ 178.682 │  3.4072 │ 3.38448 │
└─────┴────────────────────────────────┴─────────┴─────────┴─────────┘

CPU slowdown (vs. triply-periodic)

                         Topologies relative performance (CPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬─────────┬─────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │  memory │  allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼─────────┼─────────┤
│           CPU │ 192 │    (Bounded, Bounded, Bounded) │  1.34309 │ 1.46266 │ 1.48014 │
│           CPU │ 192 │   (Bounded, Bounded, Periodic) │  1.25281 │ 1.30927 │ 1.30084 │
│           CPU │ 192 │   (Bounded, Periodic, Bounded) │  1.05249 │ 1.30927 │ 1.30084 │
│           CPU │ 192 │  (Bounded, Periodic, Periodic) │  1.07645 │ 1.14247 │ 1.08664 │
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.0409 │ 1.30927 │ 1.30084 │
│           CPU │ 192 │  (Periodic, Bounded, Periodic) │ 0.938853 │ 1.14247 │ 1.08664 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │  1.17749 │ 1.14247 │ 1.08664 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │     1.0 │     1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴─────────┴─────────┘

GPU slowdown (vs. triply-periodic)

                         Topologies relative performance (GPU)
┌───────────────┬─────┬────────────────────────────────┬──────────┬──────────┬─────────┐
│ Architectures │  Ns │                     Topologies │ slowdown │   memory │  allocs │
├───────────────┼─────┼────────────────────────────────┼──────────┼──────────┼─────────┤
│           GPU │ 192 │    (Bounded, Bounded, Bounded) │  2.22277 │ 0.965821 │ 1.70133 │
│           GPU │ 192 │   (Bounded, Bounded, Periodic) │  1.82308 │ 0.980729 │   1.496 │
│           GPU │ 192 │   (Bounded, Periodic, Bounded) │  1.82349 │ 0.984497 │  1.4736 │
│           GPU │ 192 │  (Bounded, Periodic, Periodic) │  2.26462 │ 0.989389 │ 1.24018 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │  1.82237 │ 0.984695 │ 1.47467 │
│           GPU │ 192 │  (Periodic, Bounded, Periodic) │  1.54676 │ 0.992331 │ 1.25653 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │  1.30183 │  0.99405 │ 1.22631 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │      1.0 │      1.0 │     1.0 │
└───────────────┴─────┴────────────────────────────────┴──────────┴──────────┴─────────┘

Performance vs. main branch

Main branch

                                                    Topologies benchmarks
┌───────────────┬─────┬────────────────────────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬────────┐
│ Architectures │  Ns │                     Topologies │       min │    median │      mean │       max │     memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼────────┤
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.922 s │   1.922 s │   1.967 s │   2.058 s │ 363.61 KiB │   2163 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │   2.143 s │   2.144 s │   2.145 s │   2.146 s │ 317.33 KiB │   1807 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │   1.791 s │   1.793 s │   1.793 s │   1.794 s │ 277.77 KiB │   1661 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │ 32.188 ms │ 37.447 ms │ 36.936 ms │ 37.557 ms │ 985.94 KiB │  13476 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │ 11.051 ms │ 11.114 ms │ 11.148 ms │ 11.533 ms │ 807.44 KiB │  10746 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │  9.859 ms │ 10.104 ms │ 10.136 ms │ 10.682 ms │ 707.81 KiB │   9469 │
└───────────────┴─────┴────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴────────┘

This PR/branch

                                                    Topologies benchmarks
┌───────────────┬─────┬────────────────────────────────┬───────────┬───────────┬───────────┬───────────┬────────────┬────────┐
│ Architectures │  Ns │                     Topologies │       min │    median │      mean │       max │     memory │ allocs │
├───────────────┼─────┼────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼────────────┼────────┤
│           CPU │ 192 │   (Periodic, Bounded, Bounded) │   1.864 s │   1.869 s │   1.868 s │   1.871 s │ 363.28 KiB │   2162 │
│           CPU │ 192 │  (Periodic, Periodic, Bounded) │   2.092 s │   2.114 s │   2.109 s │   2.121 s │ 317.00 KiB │   1806 │
│           CPU │ 192 │ (Periodic, Periodic, Periodic) │   1.780 s │   1.796 s │   1.801 s │   1.828 s │ 277.47 KiB │   1662 │
│           GPU │ 192 │   (Periodic, Bounded, Bounded) │ 12.717 ms │ 18.315 ms │ 17.841 ms │ 18.997 ms │ 930.92 KiB │   8295 │
│           GPU │ 192 │  (Periodic, Periodic, Bounded) │ 10.097 ms │ 13.083 ms │ 12.793 ms │ 13.159 ms │ 939.77 KiB │   6898 │
│           GPU │ 192 │ (Periodic, Periodic, Periodic) │  8.948 ms │ 10.050 ms │  9.948 ms │ 10.128 ms │ 945.39 KiB │   5625 │
└───────────────┴─────┴────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴────────────┴────────┘

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Feb 4, 2021

Some highlights/conclusions:

  1. (Periodic, Bounded, Bounded) channels used to take ~37 ms/time step on GPUs but are now takes ~18 ms/time step so it's ~2x as fast 🎉
  2. Our favorite (Periodic, Periodic, Bounded) topology slowed down a bit: from ~11 ms to ~13 ms/time step. Might be due to extra kernel launches in the discrete transforms (index permutations are now done in the discrete transforms).
  3. The FFTBasedPoissonSolver allocates quite a bit of memory when DCTs are involved. Should probably see if there are any obvious sources of memory allocations that can be improved but probably not absolutely required.

@ali-ramadhan ali-ramadhan marked this pull request as ready for review February 4, 2021 16:05
@navidcy
Copy link
Collaborator

navidcy commented Feb 4, 2021

(Bounded, Bounded, Bounded) on GPU!??

very relevant for #1085

@glwagner
Copy link
Member

glwagner commented Feb 4, 2021

(Bounded, Bounded, Bounded) on GPU!?

You got it.

dims :: Δ
topology :: Ω
normalization :: N
twiddle :: T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a technical name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I'll rename to twiddle_factors which is still pretty technical but has the benefit of having a Wikipedia page: https://en.wikipedia.org/wiki/Twiddle_factor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, I was joking but it is a technical name. Could write

twiddle_factors :: T # https://en.wikipedia.org/wiki/Twiddle_factor

periodic_dims = findall(t -> t == Periodic, topo)
bounded_dims = findall(t -> t == Bounded, topo)

if arch isa GPU && topo in non_batched_topologies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😱

)
end

else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the case (Periodic, Periodic, Periodic)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case where batching transforms is possible. It's always possible on the CPU since FFTW is awesome so it includes all topologies on the CPU.

On the GPU batching is possible when the topology is not one of non_batched_topologies (where an FFT is needed along dimension 2), so it includes (Periodic, Periodic, Periodic), (Periodic, Periodic, Bounded), and (Bounded, Periodic, Periodic).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future generations may thank you if you put this comment in the code

@@ -0,0 +1,23 @@
function solve_poisson_equation!(solver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a docstring that says something like "we solve ∇²ϕ = RHS"?

# Setting DC component of the solution (the mean) to be zero. This is also
# necessary because the source term to the Poisson equation has zero mean
# and so the DC component comes out to be ∞.
CUDA.@allowscalar ϕ[1, 1, 1] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the problem that λx[1, 1, 1] + λy[1, 1, 1] + λz[1, 1, 1] = 0? If RHS[1, 1, 1] = 0 we get NaN, otherwise we'd get Inf (and we can't have either). I'm not 100% sure what DC refers to but I think it's sufficient to mention that, in eigenspace, ϕ[1, 1, 1] is the "zeroth mode" corresponding to the volume mean of the transform of ϕ, or of ϕ in physical space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, so you get one NaN which then turns the entire array into NaNs once you apply an inverse transform.

Ah sorry DC component = direct current component lol, I guess referring to the fact that there's a non-zero mean.

The physical reasoning I've had for why this step is needed is that solutions to Poisson's equation are only unique up to a constant (the global mean of the solution), so we need to pick a constant. ϕ[1, 1, 1] = 0 chooses the constant to be zero so that the solution has zero-mean.

I guess we always take gradients of the pressure so the constant theoretically shouldn't matter.

I'll improve the comment and add your suggestion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining as I thought DC was discrete cosine, but since we are talking about the mean, I guess that would be the first component of the discrete cosine transform?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's there whether we're doing FFTs for Periodic or DCTs for Bounded but yeah I think in both cases it's the first (zeroth?) component of the transform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The physical reasoning I've had for why this step is needed is that solutions to Poisson's equation are only unique up to a constant

I agree with this! We also know that adding a constant to pressure leaves the problem unchanged (at least for the equation of state we use...)

"""
@kernel function copy_pressure!(p, ϕ, solver_type, arch, grid)
@kernel function copy_pressure!(p, ϕ, arch, grid::AbstractGrid{FT, TX, TY, TZ}) where {FT, TX, TY, TZ}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need FT, TX, TY, TZ? Maybe I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah not sure why that's there. Will remove.

Copy link
Member

@glwagner glwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work @ali-ramadhan !

@navidcy
Copy link
Collaborator

navidcy commented Feb 4, 2021

Definitely new release after this.

@ali-ramadhan
Copy link
Member Author

image
feels pretty good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants