You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checkpoint/restart working fine with equal number of PEs used for checkpoint and restart. Now I would like to checkpoint with 2 PEs and restart with only 1 (or 3), but I run into a segfault. This is the best trace I've been able to get using valgrind's memcheck:
[0]CkRestartMain done. sending out callback.
==631045== Invalid read of size 8
==631045== at 0xB6FBA6: _processForPlainChareMsg(CkCoreState*, envelope*) (charm/src/ck-core/ck.C:966)
==631045== by 0xB6F205: _processHandler(void*, CkCoreState*) (charm/src/ck-core/ck.C:1287)
==631045== by 0xC7A516: CmiHandleMessage (charm/src/conv-core/convcore.C:1696)
==631045== by 0xC7A80D: CsdScheduleForever (charm/src/conv-core/convcore.C:1943)
==631045== by 0xC7A569: CsdScheduler (charm/src/conv-core/convcore.C:1882)
==631045== by 0xCC7B0D: ConverseRunPE(int) (charm/src/arch/util/machine-common-core.C:1614)
==631045== by 0xCC7378: ConverseInit (charm/src/arch/util/machine-common-core.C:1529)
==631045== by 0xC61E2E: charm_main (charm/src/ck-core/init.C:1756)
==631045== by 0xA8E051: main (charm/src/ck-core/main.C:5)
==631045== Address 0x8 is not stack'd, malloc'd or (recently) free'd
==631045==
Caught signal 11 (SIGSEGV)
>>> Exception: Signal caught
>>>
>>> =========== CALL TRACE ===========
>>>
>>> /lib/x86_64-linux-gnu/libc.so.6 : ()+0x38d60
>>> Main/inciter() [0xb6fba6]
>>> Main/inciter : _processHandler(void*, CkCoreState*)+0x2a6
>>> Main/inciter : CmiHandleMessage()+0x97
>>> Main/inciter : CsdScheduleForever()+0xce
>>> Main/inciter : CsdScheduler()+0x1a
>>> Main/inciter() [0xcc7b0e]
>>> Main/inciter : ConverseInit()+0x709
>>> Main/inciter : charm_main()+0x3f
>>> Main/inciter : main()+0x22
>>> /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0xea
>>> Main/inciter : _start()+0x2a
>>>
>>> ======= END OF CALL TRACE ========
>>>
[Partition 0][Node 0] End of program
This happens after all migrate constructors and pup routines of all chare arrays, a group, and a node group have been successfully called and apparently before the resume callback, passed to CkStartCheckpoint() at checkpoint, is called. I am running git tag v7.0.0-rc2 built with cmake -DTARGET=LIBS -DNETWORK=mpi -DSMP=OFF -DENABLE_FORTRAN=off -DCMAKE_BUILD_TYPE=Debug on linux.
I am suspecting that I am missing something pretty obvious, since I'm not very familiar this kind of checkpoint/restart and/or shrink/expand. I wonder if anyone has any suggestion how I can continue debugging this.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi folks,
I have checkpoint/restart working fine with equal number of PEs used for checkpoint and restart. Now I would like to checkpoint with 2 PEs and restart with only 1 (or 3), but I run into a segfault. This is the best trace I've been able to get using valgrind's memcheck:
This happens after all migrate constructors and pup routines of all chare arrays, a group, and a node group have been successfully called and apparently before the resume callback, passed to
CkStartCheckpoint()
at checkpoint, is called. I am running git tagv7.0.0-rc2
built withcmake -DTARGET=LIBS -DNETWORK=mpi -DSMP=OFF -DENABLE_FORTRAN=off -DCMAKE_BUILD_TYPE=Debug
on linux.I am suspecting that I am missing something pretty obvious, since I'm not very familiar this kind of checkpoint/restart and/or shrink/expand. I wonder if anyone has any suggestion how I can continue debugging this.
Thanks,
Jozsef
Beta Was this translation helpful? Give feedback.
All reactions