Adding more scaling results.

DLR-AMR · Nov 11, 2024 · 22f86c2 · 22f86c2
1 parent 8b5757a
commit 22f86c2
Show file tree

Hide file tree

Showing 6 changed files with 77 additions and 12 deletions.
diff --git a/paper.bib b/paper.bib
@@ -608,3 +608,11 @@ @article{geuzaine2009gmsh
   year={2009},
   publisher={Wiley Online Library}
 }
+
+@misc{terrabyte.lrz.de,
+  title = {terrabyte supercomputer},
+  author = {Leibniz Supercomputing Centre},
+  url = {https://docs.terrabyte.lrz.de},
+  urldate = {2024-11-11},
+  publisher = {LRZ},
+}
diff --git a/paper.md b/paper.md
@@ -201,18 +201,30 @@ on the JUQUEEN and the JUWELS supercomputers at the Jülich Supercomputing
 Center. In \autoref{tab:t8code_runtimes}, [@holke_optimized_2021] we show that
 `t8code`'s ghost routine is exceptionally fast with proper scaling of up to 1.1
 trillion mesh elements. Computing ghost layers around parallel domains is
-usually the most expensive of all mesh operations. Furthermore, in a prototype
-code [@Dreyer2021] implementing a high-order discontinuous Galerkin method (DG)
-for advection-diffusion equations on dynamically adaptive hexahedral meshes we
-can report of a 12 times speed-up compared to non-AMR meshes with only an
-overall 15\% runtime contribution of `t8code`. In \autoref{fig:t8code_runtimes}
-we compare the runtimes over number of processes of the DG solver and the
-summed mesh operations done by t8code which are ghost computation, ghost data
-exchange, partitioning (load balancing), refinement and coarsening as well as
-balancing ensuring only a difference of one refinement level among element's
-face neighbors. From the graphs in \autoref{fig:t8code_runtimes} we clearly
-see that `t8code` only takes around 15\% to 20\% of overall runtime compared
-to the solver.
+usually the most expensive of all mesh operations. To put these results into
+perspective, we conducted scaling tests on the terrabyte cluster
+[terrabyte.lrz.de] at Leibniz Computing Centre comparing the ghost layer
+creation runtimes of p4est and t8code. See \autoref{fig:t8code_runtimes} for
+the results. The p4est library has been established as one of the most
+performant meshing libraries [@BursteddeWilcoxGhattas11] specializing on
+adaptive quadrilateral and hexahedral meshes. Clearly, t8code shows near
+perfect scaling for tetrahedral meshes on par with p4est. The absolute runtime
+of t8code is around 1.5 times the runtime of p4est measured on a per ghost
+element basis. This is expected since the ghost layer algorithm is more complex
+and thus a bit less optimized due to the support of a wide range of element
+types.
+
+Furthermore, in a prototype code [@Dreyer2021] implementing a high-order
+discontinuous Galerkin method (DG) for advection-diffusion equations on
+dynamically adaptive hexahedral meshes we can report of a 12 times speed-up
+compared to non-AMR meshes with only an overall 15\% runtime contribution of
+`t8code`. In \autoref{fig:t8code_runtimes} we compare the runtimes over number
+of processes of the DG solver and the summed mesh operations done by t8code
+which are ghost computation, ghost data exchange, partitioning (load
+balancing), refinement and coarsening as well as balancing ensuring only a
+difference of one refinement level among element's face neighbors. From the
+graphs in \autoref{fig:t8code_runtimes} we clearly see that `t8code` only takes
+around 15\% to 20\% of overall runtime compared to the solver.
 
 +----------------+-------------------+--------------------+--------+
 | \# Process     | \# Elements       | \# Elem. / process |  Ghost |
@@ -226,6 +238,16 @@ to the solver.
 | elements. \label{tab:t8code_runtimes}                            |
 +================+===================+====================+========+
 
+![Runtimes of ghost layer creation on the terraybyte cluster
+[terrabyte.lrz.de] for p4est and t8code. The meshes have been refined into
+a Menger sponge for hexahedral mesh with p4est (max. level 12) and a Sierpinski
+sponge for the tetrahedral mesh in t8code (max. level 13) to create a fractal
+pattern with billions of elements as a stress test. To make the two runs
+comparable the runtimes have been divided by the average local number of ghost
+elements on a MPI rank.
+\label{fig:ghost_layer_runtimes}
+](pics/plot-timings-per-num-ghosts.png){width="90%"}
+
 ![Runtimes on JUQUEEN of the solver and summed mesh operations of our DG
 prototype code coupled with `t8code`. Mesh operations are ghost computation,
 ghost data exchange, partitioning (load balancing), refinement and coarsening

diff --git a/pics/plot-timings-per-num-ghosts.gpi b/pics/plot-timings-per-num-ghosts.gpi
@@ -0,0 +1,25 @@
+set terminal pngcairo enhanced
+
+set output "plot-timings-per-num-ghosts.png"
+
+set encoding utf8
+
+set grid
+
+p4hex = "timings-p4.dat"
+t8tet = "timings-t8-tet.dat"
+
+set xlabel "number of MPI ranks"
+set ylabel "ghost layer runtime over #ghosts [μs/#ghosts]"
+
+set logscale x 2
+
+set yrange[0:7]
+
+set key bottom
+
+set title "Runtimes of ghost layer creation per ghost element over num. of proc."
+
+plot \
+  p4hex using 1:($2/$3 * 1e6) with lp lw 2 ps 2 title "p4est with hexhedral mesh (218 billion elements)", \
+  t8tet using 1:($2/$3 * 1e6) with lp lw 2 ps 2 title "t8code with tetrahedral mesh (93 billion elements)"
diff --git a/pics/plot-timings-per-num-ghosts.png b/pics/plot-timings-per-num-ghosts.png
diff --git a/pics/timings-p4.dat b/pics/timings-p4.dat
@@ -0,0 +1,5 @@
+80    2.7853    620022
+160   1.47469   295267
+320   0.795606  206677
+640   0.431872   98434
+1280  0.246186   68897
diff --git a/pics/timings-t8-tet.dat b/pics/timings-t8-tet.dat
@@ -0,0 +1,5 @@
+80    2.9717   620022
+160   1.79744  295267
+320   1.17688  206677
+640   0.591025  98434
+1280  0.332564  68897