-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/TEST: Fix failures in max_lanes test when have two IB devices #10549
UCP/TEST: Fix failures in max_lanes test when have two IB devices #10549
Conversation
src/ucp/wireup/select.c
Outdated
(context->config.ext.max_rndv_lanes == 0) || | ||
(context->config.ext.rndv_mode == UCP_RNDV_MODE_AM) || | ||
(context->config.ext.rndv_mode == UCP_RNDV_MODE_RKEY_PTR)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- we also use rma_bw lanes for RMA
- in theory we can fail to use one of the mentioned rndv schemes and fallback to some other which is using rma_bw lanes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with the first point (so we need to check the feature flags),
regarding the second point - the fallback is always to AM scheme that does not use RMA lanes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relying on rndv_scheme still seems risky/error prone to me. Recently we changed the fallback flow for rndv protocols when requested scheme is not available, so it can happen again in the future and this code can implicitly affect that
src/ucp/wireup/select.c
Outdated
return 1; | ||
} | ||
|
||
/* RMA API is used and multi-lane RMA is enabled */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is confusing, we may use rma_bw lane even with single lane (also you check that max_rma_lanes > 0, not 1)
bw_info.max_lanes = ucs_max(bw_info.max_lanes, | ||
context->config.ext.max_rndv_lanes - 1); | ||
} | ||
excluded_am_lane = UCP_NULL_LANE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also update construct_lanes?
https://github.com/openucx/ucx/blob/master/src/ucp/wireup/select.c#L2394
@brminich can you pls take a look? |
1. When the test is run with RNDV_MODE=am_bcopy/zcopy, we may select rma_bw lanes on one device, leaving no place for am_bw lanes (if the am_bw score dictates the other device should be used). Fix by not adding rma_bw lanes when RDNV_MODE is forced to active messages or rkey_ptr. 2. The test is run with tag offload enabled, sometimes tag offload lane can be different than rma_bw/am_bw lanes, so we get one less lane for large data. Fix by expecting to use almost all lanes instead of all of them.
49e0f65
to
a146329
Compare
Why
Fix failures on new01 machine:
How
When the test is run with RNDV_MODE=am_bcopy/zcopy, we may select rma_bw lanes on one device, leaving no place for am_bw lanes (if the am_bw score dictates the other device should be used). Fix by not adding rma_bw lanes when RDNV_MODE is forced to active messages or rkey_ptr.
The test is run with tag offload enabled, sometimes tag offload lane can be different than rma_bw/am_bw lanes, so we get one less lane for large data. Fix by expecting to use almost all lanes instead of all of them.