You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running on a cluster (Dawn@Cambridge) with 4OAM PVC nodes, and 4x mlx5 cards, appearing as mlx5_0 ... mlx5_3
on ibstat.
Intel MPI runs with performance that is only commensurate with using one of the mlx5 HDR 200 cards.
(200Gbit/s x send + receive) = 50GB/s bidrectional.
I expect nearer 200GB/s bidirectional out the node when running 8 MPI tasks per node.
Setting I_MPI_DEBUG=5, it displays the provider info.
I understand that provider fi_mlx uses UCX underneath.
To get multirail working (one rail per MPI rank), I tried running through a wrapper
scripe that uses $SLURM_LOCALID to set $UCX_NET_DEVICES:
select.c:627 UCX ERROR no active messages transport to <no debug data>: self/memory - Destination is unreachable
Any ideas about how to get using multiple adaptors under Intel MPI and the fi_mlx transport?
Is this the right direction?
Is there something I should do different?
The text was updated successfully, but these errors were encountered:
I'm running on a cluster (Dawn@Cambridge) with 4OAM PVC nodes, and 4x mlx5 cards, appearing as mlx5_0 ... mlx5_3
on ibstat.
Intel MPI runs with performance that is only commensurate with using one of the mlx5 HDR 200 cards.
(200Gbit/s x send + receive) = 50GB/s bidrectional.
I expect nearer 200GB/s bidirectional out the node when running 8 MPI tasks per node.
Setting I_MPI_DEBUG=5, it displays the provider info.
This works and is "slow".
I understand that provider fi_mlx uses UCX underneath.
To get multirail working (one rail per MPI rank), I tried running through a wrapper
scripe that uses $SLURM_LOCALID to set $UCX_NET_DEVICES:
But this results in:
Any ideas about how to get using multiple adaptors under Intel MPI and the fi_mlx transport?
Is this the right direction?
Is there something I should do different?
The text was updated successfully, but these errors were encountered: