Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuda aware run of deltawing case fails on perlmutter #89

Open
cwsmith opened this issue Mar 6, 2024 · 0 comments
Open

cuda aware run of deltawing case fails on perlmutter #89

cwsmith opened this issue Mar 6, 2024 · 0 comments

Comments

@cwsmith
Copy link

cwsmith commented Mar 6, 2024

environment

$ module li

Currently Loaded Modules:
  1) craype-x86-milan     3) craype-network-ofi                      5) PrgEnv-gnu/8.5.0   7) cray-libsci/23.12.5   9) craype/2.7.30    11) perftools-base/23.12.0  13) cudatoolkit/12.2       15) gpu/1.0
  2) libfabric/1.15.2.0   4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta   6) cray-dsmml/0.2.2   8) cray-mpich/8.1.28    10) gcc-native/12.3  12) cpe/23.12               14) craype-accel-nvidia80

versions

  • Omega_h: scorec/omega_h master @ 7a39707
  • Kokkos: kokkos/kokkos master @ e0dc0128e

build

$ cat doConfigPerlKk.sh 
bdir=$PWD/build-kokkos
cmake -S kokkos -B $bdir \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=ON \
  -DCRAYPE_LINK_TYPE=dynamic \
  -DCMAKE_CXX_COMPILER=$PWD/kokkos/bin/nvcc_wrapper \
  -DKokkos_ARCH_AMPERE80=ON \
  -DKokkos_ENABLE_SERIAL=ON \
  -DKokkos_ENABLE_OPENMP=off \
  -DKokkos_ENABLE_CUDA=on \
  -DKokkos_ENABLE_CUDA_LAMBDA=on \
  -DKokkos_ENABLE_DEBUG=off \
  -DCMAKE_INSTALL_PREFIX=$bdir/install
$ cat doConfigPerlOmegah.sh 
#!/bin/bash -ex

usage="Usage: $0  <mpi=on|off> <cudaAware=on|off>"
[[ $# -ne 2 ]] && echo $usage && exit 1

mpi=$1
[[ $mpi != "on" && $mpi != "off" ]] && echo $usage && exit 1

cudaAware=$2
[[ $cudaAware != "on" && $cudaAware != "off" ]] && echo $usage && exit 1

bdir=$PWD/build-omegah-mpi${mpi}-cudaAware${cudaAware}
cmake -S omega_h -B $bdir \
  -DCMAKE_INSTALL_PREFIX=$bdir/install \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=on \
  -DOmega_h_USE_Kokkos=on \
  -DOmega_h_CUDA_ARCH=80 \
  -DOmega_h_USE_MPI=$mpi \
  -DOmega_h_USE_CUDA_AWARE_MPI=$cudaAware \
  -DBUILD_TESTING=on \
  -DCMAKE_CXX_COMPILER=CC

run

Download the Omega_h delta wing meshes: https://zenodo.org/records/10672130

$ cat submitP2.sh
sbatch --nodes 1 --qos regular --time 00:10:00 --constraint gpu --gpus 4 --account=PROJECT_NAME ./runP2.sh
$ cat runP2.sh
#!/bin/bash
bin_cudaAwareOff=/pscratch/sd/c/cwsmith/omegahDeltaWingAdapt/twoGpus/build-omegah-mpion-cudaAwareoff/src
bin_cudaAwareOn=/pscratch/sd/c/cwsmith/omegahDeltaWingAdapt/twoGpus/build-omegah-mpion-cudaAwareon/src
mesh=/pscratch/sd/c/cwsmith/omegahDeltaWingAdapt/twoGpus/deltaWing_500kMetric_p2.osh

cmd="$bin_cudaAwareOff/ugawg_hsc_oshmeshload --osh-pool $mesh"
export MPICH_GPU_SUPPORT_ENABLED=0
set -x
srun -n 2 $cmd &> log2p_cudaAwareOff
set +x

cmd="$bin_cudaAwareOn/ugawg_hsc_oshmeshload --osh-pool $mesh"
export MPICH_GPU_SUPPORT_ENABLED=1
set -x
srun -n 2 $cmd &> log2p_cudaAwareOn
set +x

error

$ cat log2p_cudaAwareOn
(GTL DEBUG: 0) cuIpcGetMemHandle: invalid argument, CUDA_ERROR_INVALID_VALUE, line no 148
MPICH ERROR [Rank 0] [job id 22622708.1] [Wed Mar  6 07:48:56 2024] [nid002241] - Abort(606713346) (rank 0 in comm 0): Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x623196f88, count=2382, MPI_INT, dest=1, tag=42, comm=0xc4000000, request=0x23c3f34) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 
(unknown)(): Invalid count

aborting job:
Fatal error in PMPI_Isend: Invalid count, error stack:
PMPI_Isend(161)......................: MPI_Isend(buf=0x623196f88, count=2382, MPI_INT, dest=1, tag=42, comm=0xc4000000, request=0x23c3f34) failed
MPID_Isend(584)......................: 
MPIDI_isend_unsafe(136)..............: 
MPIDI_SHM_mpi_isend(323).............: 
MPIDI_CRAY_Common_lmt_isend(84)......: 
MPIDI_CRAY_Common_lmt_export_mem(103): 
(unknown)(): Invalid count
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
srun: error: nid002241: task 0: Exited with exit code 255
srun: Terminating StepId=22622708.1
slurmstepd: error: *** STEP 22622708.1 ON nid002241 CANCELLED AT 2024-03-06T15:48:58 ***
srun: error: nid002241: task 1: Terminated
srun: Force Terminated StepId=22622708.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant