Messy process-assignments with Lagged Radiation #932

cponder · 2021-10-05T20:16:25Z

This is an issue I have with the v7.* code, using the lagged-radiation setup.
To run with GPUs, we invoke MPAS-A using a wrapper that sets the ACC_DEVICE_NUM variable, which assigns a GPU to each process. This is necessary to keep the Dynamics processes, i.e. MPAS_DYNAMICS_RANKS_PER_NODE, running on separate GPUs for maximal performance.
The problem is that the ranks assigned to the Dynamics processes don't follow a very simple pattern, for example for [4 Radiation]+[2 Dynamics]

0     Radiation
1     Dynamics
2     Radiation
3     Dynamics
4     Radiation
5     Radiation

I assume this derives from using some simple bit-tests to determine which ranks would be assigned to each type.
This has caused me some headaches but is going to be worse as we try to explain it to customers without having to solve the problem for them with each configuration.
I'd suggest we use the following arithmetic (bash syntax):

POOL_SIZE=$(((MPAS_DYNAMICS_RANKS_PER_NODE+MPAS_RADIATION_RANKS_PER_NODE)/MPAS_DYNAMICS_RANKS_PER_NODE))
$((SLURM_LOCALID%POOL_SIZE))    # == 0 for dynamics ranks
$((SLURM_LOCALID/POOL_SIZE))    # Gives a pool number where all the Radiation ranks are coupled
                                # to the same Dynamics rank.

I think this would be a lot easier to manage from our end and shouldn't complicate the MPAS-A code.

The text was updated successfully, but these errors were encountered:

cponder · 2021-10-05T20:22:29Z

In the case where the MPAS_RADIATION_RANKS_PER_NODE is not evenly divided by the MPAS_RADIATION_RANKS_PER_NODE, the above formula would end up with the last "pool" having fewer ranks than the others. This may be less load-balanced than, say, having several pools with only $((POOL_SIZE-1)) ranks.
Was this the motivation for the current policy?
I'm not sure I really care about that case, but there may still be some simpler arithmetic to manage it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Messy process-assignments with Lagged Radiation #932

Messy process-assignments with Lagged Radiation #932

cponder commented Oct 5, 2021 •

edited

Loading

cponder commented Oct 5, 2021

Messy process-assignments with Lagged Radiation #932

Messy process-assignments with Lagged Radiation #932

Comments

cponder commented Oct 5, 2021 • edited Loading

cponder commented Oct 5, 2021

cponder commented Oct 5, 2021 •

edited

Loading