You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an issue I have with the v7.* code, using the lagged-radiation setup.
To run with GPUs, we invoke MPAS-A using a wrapper that sets the ACC_DEVICE_NUM variable, which assigns a GPU to each process. This is necessary to keep the Dynamics processes, i.e. MPAS_DYNAMICS_RANKS_PER_NODE, running on separate GPUs for maximal performance.
The problem is that the ranks assigned to the Dynamics processes don't follow a very simple pattern, for example for [4 Radiation]+[2 Dynamics]
I assume this derives from using some simple bit-tests to determine which ranks would be assigned to each type.
This has caused me some headaches but is going to be worse as we try to explain it to customers without having to solve the problem for them with each configuration.
I'd suggest we use the following arithmetic (bash syntax):
POOL_SIZE=$(((MPAS_DYNAMICS_RANKS_PER_NODE+MPAS_RADIATION_RANKS_PER_NODE)/MPAS_DYNAMICS_RANKS_PER_NODE))
$((SLURM_LOCALID%POOL_SIZE)) # == 0 for dynamics ranks
$((SLURM_LOCALID/POOL_SIZE)) # Gives a pool number where all the Radiation ranks are coupled
# to the same Dynamics rank.
I think this would be a lot easier to manage from our end and shouldn't complicate the MPAS-A code.
The text was updated successfully, but these errors were encountered:
In the case where the MPAS_RADIATION_RANKS_PER_NODE is not evenly divided by the MPAS_RADIATION_RANKS_PER_NODE, the above formula would end up with the last "pool" having fewer ranks than the others. This may be less load-balanced than, say, having several pools with only $((POOL_SIZE-1)) ranks.
Was this the motivation for the current policy?
I'm not sure I really care about that case, but there may still be some simpler arithmetic to manage it.
This is an issue I have with the v7.* code, using the lagged-radiation setup.
To run with GPUs, we invoke MPAS-A using a wrapper that sets the
ACC_DEVICE_NUM
variable, which assigns a GPU to each process. This is necessary to keep the Dynamics processes, i.e.MPAS_DYNAMICS_RANKS_PER_NODE
, running on separate GPUs for maximal performance.The problem is that the ranks assigned to the Dynamics processes don't follow a very simple pattern, for example for [4 Radiation]+[2 Dynamics]
I assume this derives from using some simple bit-tests to determine which ranks would be assigned to each type.
This has caused me some headaches but is going to be worse as we try to explain it to customers without having to solve the problem for them with each configuration.
I'd suggest we use the following arithmetic (
bash
syntax):I think this would be a lot easier to manage from our end and shouldn't complicate the MPAS-A code.
The text was updated successfully, but these errors were encountered: