Skip to content

Swe performance exercise

fomics edited this page Jul 16, 2013 · 12 revisions

Prepare your Git repository

Start by cloning a clean version of the SWE.git repository master branch

git clone --recursive https://github.com/fomics/SWE.git

... and apply a patch to the Sconstruct file

cd SWE
git cherry-pick 6624e61b9b82a8e6098bcf556d187a7fb9d7f492

Compile SWE MPI/OpenMP hybrid using the GNU compiler (with Cray compiler wrapper)

module swap PrgEnv-cray PrgEnv-gnu
module load scons
module load python/2.7.2

Edit the SConstruct file to add the -fopenmp option necessary for the OpenMP compilation.

Before:

# OpenMP parallelism?
if env['compiler'] == 'intel' and env['openmp']:
  env.Append(CCFLAGS=['-openmp'])
  env.Append(LINKFLAGS=['-openmp'])

After:

# OpenMP parallelism?
if env['compiler'] == 'intel' and env['openmp']:
  env.Append(CCFLAGS=['-openmp'])
  env.Append(LINKFLAGS=['-openmp'])
if env['compiler'] == 'cray' and env['openmp']:
  env.Append(CCFLAGS=['-fopenmp'])
  env.Append(LINKFLAGS=['-fopenmp'])

Now compile the MPI/OpenMP hybrid version with:

scons copyenv=true compiler=cray parallelization=mpi solver=fwavevec openmp=yes
  • Open the file src/blocks/SWE_WavePropagation.cpp
    • Add the line #define LOOP_OPENMP before
#ifdef LOOP_OPENMP
#include <omp.h>
#endif
  • Comment out the line that starts with solver::Hybrid<float>, which is only needed for the hybrid Riemann or Float solver.

Perform OpenMP parallelization in the initialization

cd src/tools

Edit the file help.hh. Comment out lines 99 and 100, initialization of float2d. In other words the resulting code should look like this:

       Float2D(int _cols, int _rows) : rows(_rows),cols(_cols)
       {
              elem = new float[rows*cols];
	      // for (int i = 0; i < rows*cols; i++)
              //      elem[i] = 0;
       }

Next, parallelize the initialization with OpenMP.

cd src/blocks

Edit the file SWE_Block.cpp. Make parallel for loops at line 97 and 107, namely for the scenario initialization for heights and bathymetry. The resulting code should look like this:

#pragma omp parallel for
  // initialize water height and discharge
  for(int i=1; i<=nx; i++)
    for(int j=1; j<=ny; j++) {
      float x = offsetX + (i-0.5f)*dx;
      float y = offsetY + (j-0.5f)*dy;
      h[i][j] = i_scenario.getWaterHeight(x,y);
      hu[i][j] = i_scenario.getVeloc_u(x,y) * h[i][j];
      hv[i][j] = i_scenario.getVeloc_v(x,y) * h[i][j];
    };

  // initialize bathymetry
#pragma omp parallel for
  for(int i=0; i<=nx+1; i++) {
    for(int j=0; j<=ny+1; j++) {
      b[i][j] = i_scenario.getBathymetry( offsetX + (i-0.5f)*dx,
                                          offsetY + (j-0.5f)*dy );
    }
  }

This is essentially the result of Michael Bader's component on SWE (without the vectorization aspects which were specific for the Intel compiler). Before anything else, try this on various numbers of MPI processes and OpenMP threads on one node to make sure that you see scalability. E.g., from SWE directory,

salloc -N 1   # get allocation of one node
OMP_NUM_THREADS=1 aprun -n 1 -d 1 build/SWE_cray_release_mpi_fwavevec -x 400 -y 400 -c 1 -o /dev/null
OMP_NUM_THREADS=16 aprun -n 1 -d 16 build/SWE_cray_release_mpi_fwavevec -x 400 -y 400 -c 1 -o /dev/null
OMP_NUM_THREADS=4 aprun -n 1 -d 4 build/SWE_cray_release_mpi_fwavevec -x 400 -y 400 -c 1 -o /dev/null

If time, try on more than one node on a much larger problem (say 16x), e.g.,

salloc -N 4   # get allocation of four node
OMP_NUM_THREADS=16 aprun -N 1 -n 4 -d 16 build/SWE_cray_release_mpi_fwavevec -x 1600 -y 1600 -c 1 -o /dev/null