Replies: 2 comments
-
I am guessing I can also try to demonstrate this with CkDieNow(). |
Beta Was this translation helpful? Give feedback.
-
Okay, I have also tried failure injection with a kill-file, which correctly kills the PE specified in the file, but I get the same behavior as before, since MPI kills the whole program. Hoping that adding spare processors my program would never get to the point of MPI abort, I have also tried adding spare processors using the +wp command line option, which apparently also works:
But I guess since I'm running with MPI, this will also not work because underlying charmrun, mpirun is executed with -np 4, which gives me 4 MPI ranks, but then CkNumPes() in a group returns only 3, which causes a deadlock as, I guess, one of my MPI processes never returns from an MPI collective. I also see in the manual that fault tolerance only works on the TCP-based net layers. I'm guessing this means it will not work with MPI, correct? If so, what options do I have for fault tolerance (other than getting rid of my MPI-libraries)? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi folks,
I wonder if in-memory checkpointing works with the MPI layer.
I successfully hooked up
CkStartMemCheckpoint()
and my app runs correctly, displayingbut when I try to test it by killing one of the processes, MPI aborts the job with
I tried setting a different OMPI error handler from a chare group constructor and also tried mpi_abort_delay via environment variables and the mpirun command line but none of these seem to have an effect.
Can this be done with MPI + Charm? Or should I try building my MPI-only libraries with AMPI?
Thanks,
Jozsef
Beta Was this translation helpful? Give feedback.
All reactions