Update perf, add ref.

omlins · Jul 29, 2024 · 003598b · 003598b
1 parent 930ba6c
commit 003598b
Show file tree

Hide file tree

Showing 2 changed files with 38 additions and 24 deletions.
diff --git a/paper/paper.tex b/paper/paper.tex
@@ -30,8 +30,8 @@ \section{Introduction}
 @parallel memopt=true optvars=T function step!(
     T2, T, Ci, lam, dt, _dx, _dy, _dz)
     @inn(T2) = @inn(T) + dt*(
-        lam*@inn(Ci)*(@d2_xi(T)*_dx^2 + 
-                      @d2_yi(T)*_dy^2 + 
+        lam*@inn(Ci)*(@d2_xi(T)*_dx^2 +
+                      @d2_yi(T)*_dy^2 +
                       @d2_zi(T)*_dz^2 ) )
     return
 end
@@ -74,9 +74,9 @@ \section{Introduction}
 \end{figure}
 
 \section{Approach}
-Our approach for the expression of architecture-agnostic high-performance stencil computations relies on the usage of Julia's powerful metaprogramming capacities, costless high-level abstractions and multiple dispatch. We have instantiated the approach in the Julia package \texttt{ParallelStencil.jl}. Using ParallelStencil, a simple call to the macro \texttt{@parallel} is sufficient to parallelize and launch a kernel that contains stencil computations, which can be expressed explicitly or with math-close notation. The latter is defined in isolated submodules (e.g., line 2) that are easily understandable and extensible by domain scientists in order to support new numerical methods (currently available is math-close notation for finite differences). Fig.~\ref{fig:code} shows a stencil-based 3-D heat diffusion xPU solver implemented using ParallelStencil, where the kernel defining an explicit time step is written in math-close notation (lines 5-12) and the macro \texttt{@parallel} is used for its parallelization (line 5) and launch (line 36). 
+Our approach for the expression of architecture-agnostic high-performance stencil computations relies on the usage of Julia's powerful metaprogramming capacities, costless high-level abstractions and multiple dispatch. We have instantiated the approach in the Julia package \texttt{ParallelStencil.jl}. Using ParallelStencil, a simple call to the macro \texttt{@parallel} is sufficient to parallelize and launch a kernel that contains stencil computations, which can be expressed explicitly or with math-close notation. The latter is defined in isolated submodules (e.g., line 2) that are easily understandable and extensible by domain scientists in order to support new numerical methods (currently available is math-close notation for finite differences). Fig.~\ref{fig:code} shows a stencil-based 3-D heat diffusion xPU solver implemented using ParallelStencil, where the kernel defining an explicit time step is written in math-close notation (lines 5-12) and the macro \texttt{@parallel} is used for its parallelization (line 5) and launch (line 36).
 
-The package used underneath for parallelization is defined in a initialization call beforehand (Fig.~\ref{fig:code}, line 3). Currently supported are \texttt{CUDA.jl} \cite{besard2018effective} for running on GPU, and \texttt{Base.Threads} for CPU. Leveraging metaprogramming, ParallelStencil automatically generates high-performance code suitable for the target hardware, and automatically derives kernel launch parameters from the kernel arguments by analyzing the bounds of the contained arrays. Certain stencil-computation-specific optimizations leveraging, e.g., the on-chip memory of GPUs need to be activated with keyword arguments to the macro \texttt{@parallel} (Fig. \ref{fig:code}, line 5). A set of architecture-agnostic low level kernel language constructs allows for explicit low level kernel programming when useful, e.g., for the explicit control of shared memory on the GPU (these low level constructs are GPU-computing-biased). 
+The package used underneath for parallelization is defined in a initialization call beforehand (Fig.~\ref{fig:code}, line 3). Currently supported are \texttt{CUDA.jl} \cite{besard2018effective} for running on GPU, and \texttt{Base.Threads} for CPU. Leveraging metaprogramming, ParallelStencil automatically generates high-performance code suitable for the target hardware, and automatically derives kernel launch parameters from the kernel arguments by analyzing the bounds of the contained arrays. Certain stencil-computation-specific optimizations leveraging, e.g., the on-chip memory of GPUs need to be activated with keyword arguments to the macro \texttt{@parallel} (Fig. \ref{fig:code}, line 5). A set of architecture-agnostic low level kernel language constructs allows for explicit low level kernel programming when useful, e.g., for the explicit control of shared memory on the GPU (these low level constructs are GPU-computing-biased).
 
 \begin{figure}[t]
     \centerline{\includegraphics[width=8cm]{julia_xpu_Teff.png}}
@@ -89,9 +89,8 @@ \section{Approach}
 ParallelStencil is seamlessly interoperable with packages for distributed parallelization, as e.g. \texttt{ImplicitGlobalGrid.jl} \cite{implicitglobalgrid2022} or \texttt{MPI.jl}, in order to enable high-performance stencil computations on GPU or CPU supercomputers. Communication can be hidden behind computation with as simple macro call \cite{implicitglobalgrid2022}. The usage of this feature solely requires that communication can be triggered explicitly as it is possible with, e.g., ImplicitGlobalGrid and \texttt{MPI.jl}.
 
 \section{Results}
-We here report the performance achieved on different architectures with the 3-D heat diffusion xPU solver (Fig.~\ref{fig:code}) and of an equivalent solver with explicit notation for the stencil computations and compare it to the performance obtained with a Julia solver written in a traditional way using GPU or CPU array broadcasting. We observe that using ParallelStencil we achieve an effective memory throughput, $T_\mathrm{eff}$, of 496~GB/s and 1262~GB/s on the Nvidia P100 and A100 GPUs, which can reach a peak throughput, $T_\mathrm{peak}$, of 561~GB/s and 1355~GB/s, respectively; this means we reach 88\% and 93\% of the respective hardware's theoretical performance upper bound ($T_\mathrm{eff}$ and its interpretation are explained, e.g., in \cite{rass2022assessing}). Furthermore, using ParallelStencil we obtain a speedup of up to a factor $\approx 5$ and 
-$\approx 29$ over the versions with GPU and CPU array broadcasting (the latter is not capable of multi-threading), respectively.
-Moreover, we have translated solvers for highly nonlinear 3-D poro-visco-elastic two-phase flow and 3-D reactive porosity waves written in CUDA C using MPI to Julia by employing ParallelStencil (and ImplicitGlobalGrid for the distributed parallelization) and compared obtained performance. The translated solvers achieved 90\% and 98\% of the performance of the respective original CUDA C solvers. In addition, relying on ParallelStencil`s feature to hide communication behind computation, the 3-D poro-visco-elastic two-phase flow solver achieved over 95\% parallel effiency on up to 1024 GPUs \cite{implicitglobalgrid2022}. 
+We here report the performance achieved on different architectures with the 3-D heat diffusion xPU solver (Fig.~\ref{fig:code}) and of an equivalent solver with explicit notation for the stencil computations and compare it to the performance obtained with a Julia solver written in a traditional way using GPU or CPU array broadcasting. We observe that using ParallelStencil we achieve an effective memory throughput, $T_\mathrm{eff}$, of 496~GB/s and 1262~GB/s on the Nvidia P100 and A100 GPUs, which can reach a peak throughput, $T_\mathrm{peak}$, of 559~GB/s and 1370~GB/s, respectively \cite{deakin2020}; this means we reach 89\% and 92\% of the respective hardware's theoretical performance upper bound ($T_\mathrm{eff}$ and its interpretation are explained, e.g., in \cite{rass2022assessing}). Furthermore, using ParallelStencil we obtain a speedup of up to a factor $\approx 5$ and $\approx 29$ over the versions with GPU and CPU array broadcasting (the latter is not capable of multi-threading), respectively.
+Moreover, we have translated solvers for highly nonlinear 3-D poro-visco-elastic two-phase flow and 3-D reactive porosity waves written in CUDA C using MPI to Julia by employing ParallelStencil (and ImplicitGlobalGrid for the distributed parallelization) and compared obtained performance. The translated solvers achieved 90\% and 98\% of the performance of the respective original CUDA C solvers. In addition, relying on ParallelStencil`s feature to hide communication behind computation, the 3-D poro-visco-elastic two-phase flow solver achieved over 95\% parallel efficiency on up to 1024 GPUs \cite{implicitglobalgrid2022}.
 
 \section{Conclusions}
 We have shown that ParallelStencil enables scalable performance, performance portability and productivity and responds to the challenge of addressing the 3 ``P''s in all of its aspects. Moreover, we have outlined the effectiveness and wide applicability of our approach within geosciences. Our approach is naturally in no sense limited to geosciences as stencil computations are commonly used in many disciplines across all of science. We illustrated this in recent contributions, where we showcased a computational cognitive neuroscience application modelling visual target selection using ParallelStencil and \texttt{MPI.jl} \cite{pasc22} and a quantum fluid dynamics solver using the nonlinear Gross-Pitaevski equation implemented with ParallelStencil (and ImplicitGlobalGrid) \cite{pasc21}.

diff --git a/paper/ref.bib b/paper/ref.bib
@@ -11,13 +11,13 @@ @article{bezanson2017julia
 }
 
 @INPROCEEDINGS{wse_stencil,
-  author={Rocki, Kamil and Essendelft, Dirk Van and Sharapov, Ilya and Schreiber, Robert and Morrison, Michael and Kibardin, Vladimir and Portnoy, Andrey and Dietiker, Jean Francois and Syamlal, Madhava and James, Michael},  
-  booktitle={SC20: International Conference for High Performance Computing, Networking, Storage and Analysis},   
-  title={Fast Stencil-Code Computation on a Wafer-Scale Processor},   
-  year={2020},  
-  volume={},  
-  number={},  
-  pages={1-14},  
+  author={Rocki, Kamil and Essendelft, Dirk Van and Sharapov, Ilya and Schreiber, Robert and Morrison, Michael and Kibardin, Vladimir and Portnoy, Andrey and Dietiker, Jean Francois and Syamlal, Madhava and James, Michael},
+  booktitle={SC20: International Conference for High Performance Computing, Networking, Storage and Analysis},
+  title={Fast Stencil-Code Computation on a Wafer-Scale Processor},
+  year={2020},
+  volume={},
+  number={},
+  pages={1-14},
   doi={10.1109/SC41405.2020.00062}
 }
 
@@ -62,15 +62,15 @@ @online{juliacon2020scaling
 }
 
 @Article{rass2022assessing,
-AUTHOR = {R\"ass, L. and Utkin, I. and Duretz, T. and Omlin, S. and Podladchikov, Y. Y.},
-TITLE = {Assessing the robustness and scalability of the accelerated pseudo-transient method},
-JOURNAL = {Geoscientific Model Development},
-VOLUME = {15},
-YEAR = {2022},
-NUMBER = {14},
-PAGES = {5757--5786},
-URL = {https://gmd.copernicus.org/articles/15/5757/2022/},
-DOI = {10.5194/gmd-15-5757-2022}
+  AUTHOR = {R\"ass, L. and Utkin, I. and Duretz, T. and Omlin, S. and Podladchikov, Y. Y.},
+  TITLE = {Assessing the robustness and scalability of the accelerated pseudo-transient method},
+  JOURNAL = {Geoscientific Model Development},
+  VOLUME = {15},
+  YEAR = {2022},
+  NUMBER = {14},
+  PAGES = {5757--5786},
+  URL = {https://gmd.copernicus.org/articles/15/5757/2022/},
+  DOI = {10.5194/gmd-15-5757-2022}
 }
 
 @online{pasc21,
@@ -87,4 +87,19 @@ @online{pasc22
   howpublished={PASC22 conference},
   location={Basel},
   year={2022},
-}
+}
+
+@inproceedings{deakin2020,
+  address = {GA, USA},
+  title = {Tracking {Performance} {Portability} on the {Yellow} {Brick} {Road} to {Exascale}},
+  isbn = {978-1-66542-287-1},
+  url = {https://ieeexplore.ieee.org/document/9309052/},
+  doi = {10.1109/P3HPC51967.2020.00006},
+  urldate = {2021-10-14},
+  booktitle = {2020 {IEEE}/{ACM} {International} {Workshop} on {Performance}, {Portability} and {Productivity} in {HPC} ({P3HPC})},
+  publisher = {IEEE},
+  author = {Deakin, Tom and Poenaru, Andrei and Lin, Tom and McIntosh-Smith, Simon},
+  month = nov,
+  year = {2020},
+  pages = {1--13},
+}