You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the default way we build OpenMPI, MPICH etc. works well for InfiniBand systems, but shows very poor performance on Omni-Path (we saw between 2x and 10x worse bandwidth and latency in benchmarks on our system).
It would be good to figure out a way to improve Omni-Path support in EasyBuild (perhaps through a configuration option?); at a minimum, we should improve documentation.
relevant PRs to date:
[#20501] PSM2 dependency added to recent libfabric easyconfigs
[#20585] PSM2 dependency made conditional on having x86_64
[#20794] previous changes effectively undone, by commenting PSM2 dependency back out due to CUDA build dependency
further info/ideas:
Omni-Path systems should use either PSM2 or opx
PSM2 can be either stand-alone, or via libfabric
opx is a libfabric provider; drop-in replacement for PSM2
Cornelis' plan is to move away from PSM2 (the upcoming 400G adapters will only support opx)
no benefit (only additional overhead) from using UCX with Omni-Path
Cornelis' documentation currently recommends using PSM2:
For best performance, Cornelis recommends that you use the PSM2, the high performance
interface to the OPX Fabric. This is accomplished using the Open Fabrics Interface (OFI) MPI
fabric setting -genv I_MPI_FABRICS=ofi and ensure that FI_PROVIDER=psm2.
Currently, the default way we build
OpenMPI
,MPICH
etc. works well forInfiniBand
systems, but shows very poor performance onOmni-Path
(we saw between 2x and 10x worse bandwidth and latency in benchmarks on our system).It would be good to figure out a way to improve
Omni-Path
support in EasyBuild (perhaps through a configuration option?); at a minimum, we should improve documentation.relevant PRs to date:
PSM2
dependency added to recentlibfabric
easyconfigsPSM2
dependency made conditional on havingx86_64
PSM2
dependency back out due toCUDA
build dependencyfurther info/ideas:
Omni-Path
systems should use eitherPSM2
oropx
PSM2
can be either stand-alone, or vialibfabric
opx
is alibfabric
provider; drop-in replacement forPSM2
PSM2
(the upcoming 400G adapters will only supportopx
)UCX
withOmni-Path
PSM2
:PSM2
in order to drop theCUDA
build dependencyOpenMPI
, by including some minimal CUDA prototypes (sincePSM2
willdlopen('libcuda.so.1')
at runtime)The text was updated successfully, but these errors were encountered: