[Feature Request]: Better interaction / documentation with non-exclusive Slurm jobs #26788

psath · 2025-02-26T21:00:24Z

Summary of Feature

Some of the multi-locale features struggle in heavily-contended Slurm environments. I can't comment on all the launchers but here's a few things I ran into:

slurm-gasnet_ibv tries to salloc to run the _real jobs ... This presumes that a newly-enqueued job will run within an interactive timescale, which may not be true.
gasnet_ibv and gasnet_ucx launchers will try to srun if the wrapper program is run from within an allocation, which is fine if you have --exclusive on your salloc but may hang without insight if you are only using a non-exclusive portion of a node. (Particularly interactively like salloc srun --pty, as any requested resources get assigned to the pty and none remain to run Chapel's inner sruns.) The --oversubscribe flag to salloc and the --overlap flag to srun (and equivalent SLURM_ environment variables) can help here, but don't seem to be incorporated auto-magically.
gasnet_ibv and gasnet_ucx don't seem to consider the current SLURM job's memory when deciding on segment size. Rather, they seem to grab for all the physical memory on the node, which will sometimes trip an OOM in srun, but mostly cause silent SIGKILL, even with GASNet tracing enabled.

(1) is reasonably documented already. Basically don't use the slurm--prefixed launcher if there's not a reasonable chance you can hop right on your node(s), or use CHPL_LAUNCHER_USE_SBATCH to have it generate a batch for you. (I haven't tried the latter, as I am mixing both Chapel and non-Chapel workloads in the same batch)
(2) is a stumbling block for folks with lesser Slurm experience. Slurm's environment variables can get non---exclusive jobs working out of the box today. My preference would be to have the Chapel wrapper automatically apply overlap flags, but I can see a case to be made that this is in the scope of the user's responsibility, in which case some documentation might save folks like me time.
(3) Seems like a bug. The communications layer shouldn't be grabbing for more memory than the Slurm job has available. I don't know offhand whether Chapel or GASNet should enforce that. There is already some reference to GASNET_PHYSMEM_MAX in the infiniband documentation but it doesn't include a notion of having effective access to less than the whole node's RAM. IIRC when I tried passing it (but not exporting) it didn't pass from myProgram to myProgram_real, or else otherwise didn't effect the SIGKILL. However, manually setting GASNET_MAX_SEGSIZE to some value within my Slurm job's allocation did get me running again.
Edit: on a fresh build GASNET_PHYSMEM_MAX seems sufficient to prevent the SIGKILL

Trying to summarize some live debugging that happened over the last weeks on Gitter. Please correct any misunderstandings or misinterpretations on my part!

Steps to reproduce:
2) Try to launch a multi-locale program (even with -nL 1) within a non-exclusive Slurm interactive job, without oversubscribe or overlap flags. Doesn't seem to matter whether you use ssh, pmi, or mpi as the spawner
3) Try to run a multi-locale program (even with -nL 1) within a non-exclusive Slurm job, where the --mem Slurm flag is some fraction of the node's physical memory. Apply the GASNET_VERBOSEENV=1 environment variable and look at the value of GASNET_MAX_SEGSIZE

The text was updated successfully, but these errors were encountered:

psath · 2025-02-26T22:15:05Z

Allocating an interactive Slurm job with salloc <args> --overcommit srun --overlap --pty bash seems sufficient to avoid having to do anything explicit with the Chapel w.r.t. (2) (i.e. no additional Slurm environment variables)

lydia-duncan · 2025-02-26T22:53:40Z

Thanks for filing, Paul! For those not following the Gitter, this was spawned based on some conversation there:

psath · 2025-03-05T16:21:40Z

An additional wrinkle that I'm encountering now that I'm extending the above to multi-node:

If you only have a fraction of the CPUs in each of multiple nodes, they may not have the same binding on all the nodes that myProg_real gets launched on.

For example I'm trying to run on 16 cores each from two 128-core nodes, launched from the primary node of two (tc-dgx006). The secondary node (tc-dgx007) faults over a bad binding immediately, and then the primary node hangs until the user Ctrl+Cs:

GASNET_PHYSMEM_MAX="$SLURM_MEM_PER_NODE"M ./myProg <myArgs> --numLocales=2
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x000000000000FFFF0000000000000000.
srun: error: Task launch for StepId=2960193.8 failed on node tc-dgx007: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2960193.8 ON tc-dgx006 CANCELLED AT 2025-03-05T11:08:03 ***
srun: error: tc-dgx006: task 0: Killed
^C[mpiexec@tc-dgx006] Sending Ctrl-C to processes as requested
[mpiexec@tc-dgx006] Press Ctrl-C again to force abort
[mpiexec@tc-dgx006] HYDU_sock_write (lib/utils/sock.c:250): write error (Bad file descriptor)
[mpiexec@tc-dgx006] send_hdr_downstream (mpiexec/pmiserv_cb.c:28): sock write error
[mpiexec@tc-dgx006] HYD_pmcd_pmiserv_send_signal (mpiexec/pmiserv_cb.c:218): unable to write data to proxy
[mpiexec@tc-dgx006] ui_cmd_cb (mpiexec/pmiserv_pmci.c:61): unable to send signal downstream
[mpiexec@tc-dgx006] HYDT_dmxu_poll_wait_for_event (lib/tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@tc-dgx006] HYD_pmci_wait_for_completion (mpiexec/pmiserv_pmci.c:173): error waiting for event
[mpiexec@tc-dgx006] main (mpiexec/mpiexec.c:260): process manager error waiting for completion
...
echo $SLURM_CPU_BIND
quiet,mask_cpu:0x00000003F060F1E00000000000000000

The srun error on the secondary node shows a binding mask of 0x000000000000FFFF0000000000000000 whereas the primary has a mask of 0x00000003F060F1E00000000000000000

I'm able to work around it by setting GASNET_IBV_SPAWNER=ssh which is less Slurm-aware, but it'd be nice to not need to. Both the pmi and mpi spawners give the above behavior.

psath added the type: Feature Request label Feb 26, 2025

lydia-duncan added user issue area: Runtime area: Third-Party area: Docs labels Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Better interaction / documentation with non-exclusive Slurm jobs #26788

[Feature Request]: Better interaction / documentation with non-exclusive Slurm jobs #26788

psath commented Feb 26, 2025 •

edited

Loading

psath commented Feb 26, 2025

lydia-duncan commented Feb 26, 2025 •

edited

Loading

psath commented Mar 5, 2025 •

edited

Loading

[Feature Request]: Better interaction / documentation with non-exclusive Slurm jobs #26788

[Feature Request]: Better interaction / documentation with non-exclusive Slurm jobs #26788

Comments

psath commented Feb 26, 2025 • edited Loading

Summary of Feature

psath commented Feb 26, 2025

lydia-duncan commented Feb 26, 2025 • edited Loading

psath commented Mar 5, 2025 • edited Loading

psath commented Feb 26, 2025 •

edited

Loading

lydia-duncan commented Feb 26, 2025 •

edited

Loading

psath commented Mar 5, 2025 •

edited

Loading