Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash with large 3D simulations on LUMI #4236

Open
tmsclark2 opened this issue Aug 25, 2023 · 3 comments
Open

Crash with large 3D simulations on LUMI #4236

tmsclark2 opened this issue Aug 25, 2023 · 3 comments
Assignees
Labels
backend: hip Specific to ROCm execution (GPUs) bug Something isn't working machine / system Machine or system-specific issue

Comments

@tmsclark2
Copy link

tmsclark2 commented Aug 25, 2023

Hi,
I got crashs with large 3D simulations on LUMI. The crash is concerning a MPI_Allgather routine :

MPICH ERROR [Rank 0] [job id 4292261.0] [Thu Aug  3 00:01:58 2023] [nid006593] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(170).......:
MPID_Init(501)..............:
MPIDI_OFI_mpi_init_hook(805):
MPIDU_bc_table_create(204)..:  PMI_Allgather failed: -1

This crash happens before warpx starts and does not produce traces.

Here is the error output of the simulations and the submit file : warpx-4292261.txt batch.txt

Here are the modules used for the compilation : Recipe_warpx.txt

@ax3l ax3l added bug Something isn't working backend: hip Specific to ROCm execution (GPUs) machine / system Machine or system-specific issue labels Aug 25, 2023
@ax3l
Copy link
Member

ax3l commented Aug 25, 2023

@ax3l
Copy link
Member

ax3l commented Aug 25, 2023

@tmsclark2 can you please open a support ticket with LUMI about this, pointing this to the OLCF issue and asking what they recommend to do at LUMI?
Please CC the people on this issue in your ticket 🙏

@ax3l
Copy link
Member

ax3l commented Aug 25, 2023

@tmsclark2 can you try this work-around on LUMI?
#4237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: hip Specific to ROCm execution (GPUs) bug Something isn't working machine / system Machine or system-specific issue
Projects
None yet
Development

No branches or pull requests

6 participants