Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/TEST: Fix failures in max_lanes test when have two IB devices #10549

Merged

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Mar 13, 2025

Why

Fix failures on new01 machine:

2025-03-06T17:40:21.7528501Z [ RUN      ] rc/multi_rail_max.max_lanes/5 <rc/rndv_am_zcopy>
2025-03-06T17:40:22.4448500Z [     INFO ] lane[0] : sender 10506263 receiver 76
2025-03-06T17:40:22.4453886Z [     INFO ] lane[1] : sender 0 receiver 0
2025-03-06T17:40:22.4459444Z /scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1258: Failure
2025-03-06T17:40:22.4461149Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
2025-03-06T17:40:22.4465411Z [     INFO ] lane[2] : sender 0 receiver 0
2025-03-06T17:40:22.4470367Z /scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1258: Failure
2025-03-06T17:40:22.4471911Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
...
2025-03-06T17:40:22.4731789Z /scrap/azure/agent-04/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/test_ucp_tag_xfer.cc:1258: Failure
2025-03-06T17:40:22.4732966Z Expected: (sender_tx + receiver_tx) > (0), actual: 0 vs 0
2025-03-06T17:40:23.0831853Z [  FAILED  ] rc/multi_rail_max.max_lanes/5, where GetParam() = rc/rndv_am_zcopy (1330 ms)

How

  1. When the test is run with RNDV_MODE=am_bcopy/zcopy, we may select rma_bw lanes on one device, leaving no place for am_bw lanes (if the am_bw score dictates the other device should be used). Fix by not adding rma_bw lanes when RDNV_MODE is forced to active messages or rkey_ptr.

  2. The test is run with tag offload enabled, sometimes tag offload lane can be different than rma_bw/am_bw lanes, so we get one less lane for large data. Fix by expecting to use almost all lanes instead of all of them.

Comment on lines 1839 to 1841
(context->config.ext.max_rndv_lanes == 0) ||
(context->config.ext.rndv_mode == UCP_RNDV_MODE_AM) ||
(context->config.ext.rndv_mode == UCP_RNDV_MODE_RKEY_PTR)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we also use rma_bw lanes for RMA
  • in theory we can fail to use one of the mentioned rndv schemes and fallback to some other which is using rma_bw lanes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with the first point (so we need to check the feature flags),
regarding the second point - the fallback is always to AM scheme that does not use RMA lanes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relying on rndv_scheme still seems risky/error prone to me. Recently we changed the fallback flow for rndv protocols when requested scheme is not available, so it can happen again in the future and this code can implicitly affect that

return 1;
}

/* RMA API is used and multi-lane RMA is enabled */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is confusing, we may use rma_bw lane even with single lane (also you check that max_rma_lanes > 0, not 1)

bw_info.max_lanes = ucs_max(bw_info.max_lanes,
context->config.ext.max_rndv_lanes - 1);
}
excluded_am_lane = UCP_NULL_LANE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe
Copy link
Contributor Author

yosefe commented Mar 20, 2025

@brminich can you pls take a look?

1. When the test is run with RNDV_MODE=am_bcopy/zcopy, we may select
   rma_bw lanes on one device, leaving no place for am_bw lanes (if the
   am_bw score dictates the other device should be used). Fix by not
   adding rma_bw lanes when RDNV_MODE is forced to active messages or
   rkey_ptr.

2. The test is run with tag offload enabled, sometimes tag offload lane
   can be different than rma_bw/am_bw lanes, so we get one less lane for
   large data. Fix by expecting to use almost all lanes instead of all
   of them.
@yosefe yosefe force-pushed the topic/ucp-test-fix-failures-in-max-lanes branch from 49e0f65 to a146329 Compare March 20, 2025 09:00
@yosefe yosefe enabled auto-merge March 21, 2025 08:10
@yosefe yosefe merged commit fc3350e into openucx:master Mar 21, 2025
151 checks passed
@yosefe yosefe deleted the topic/ucp-test-fix-failures-in-max-lanes branch March 21, 2025 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants