Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/PROTO: Minimal version of protocol lane selection #10539

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Mar 7, 2025

What?

This is the minimal version of #10508

Formerly known as "lane sorting" task.
When CUDA context is set, then rma_bw_lanes array is adjusted with GPU distance.
When CUDA context is not set in the caller thread, then UCX protocol does not always choose fastest lanes for GPU memory.

The idea is to select lanes at the protocol selection stage after performance estimation.
We need to find the best combination of lanes for a given operation.

Testing

https://confluence.nvidia.com/display/NSWX/Protocol+lane+selection+testing

Mock tests PR: #10547

@iyastreb iyastreb force-pushed the ucp/proto/lane-selection-mini branch from 1987091 to a481bee Compare March 10, 2025 07:38
brminich
brminich previously approved these changes Mar 10, 2025
perf_attr->path_ratio = 0.9;
} else {
/* Others: first path consumes 99% of the full bandwidth */
perf_attr->path_ratio = (iface_attr.dev_num_paths > 1)? 0.99 : 1.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it different from non-LAG roce?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support use case of CX6 with multiple paths.
The difference will be important at performance aggregation step (not implemented in this PR). The idea of perf aggregation is: after lanes are selected, we calculate the efficient usage of full interface bandwidth.
Efficient BW = [sum_of_selected_ratios] * path_ratio * full_bw
This formula should work both for all devices: CX6, CX7, ROCE

For ROCE device the efficient usage = num_path * path_ratio * full_bw
Because it's composed of equal parts of the same BW.
E.g. if we selected 2/10 ROCE paths, then efficient BW = 20% of the full iface BW

For CX6 it's different, if we do IB_NUM_PATHS=10 and select 2 of them, then the
efficient BW = [0.99+0.0..] = 99+% of the full iface BW

if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_RATIO) {
if (uct_ib_iface_is_roce(ib_iface)) {
/* ROCE: Equal share per each path */
perf_attr->path_ratio = 1.0 / (double)iface_attr.dev_num_paths;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd use this formula for all cases except cx-7 non-lag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for (double)

}

/* Select all available indexes */
index_map = UCS_BIT(num_lanes) - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
index_map = UCS_BIT(num_lanes) - 1;
index_map = UCS_MASK(num_lanes);

ucp_rsc_index_t rsc_index)
{
ucp_worker_iface_t *wiface = ucp_worker_iface(params->worker, rsc_index);
unsigned dev_num_paths = wiface->attr.dev_num_paths;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to use number of selected pathes not number of supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid option as well, it was initially implemented like that, since it's simpler.
The difference between the 2 options are:

  • If we know only selected paths count and don't know the overall paths count, then we should always keep a ratio gap for potentially remaining paths. E.g. for 2 paths it gives us these ratios:
    path0=0.9, path1=0.067
    so we keep 0.033 for potential next paths (we don't know how many we have)
  • If we rely on dev_num_paths, then we can split the ratios between path so that there is no gap for remaining paths. For example:
2 paths: path0=0.9, path1=0.1
3 paths: path0=0.9, path1=0.067, path2=0.033

/**
* Single path ratio of the full bandwidth.
*/
double path_ratio;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should expose single path bandwidth directly and not as a ratio

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure it can be done with absolute BW value.

But let's take a look at the use case 1: we specify non-default IB_NUM_PATHS=4 MAX_RNDV_LANES=4, and my expectation (and current behaviour) is that all these 4 paths are selected by the algorithm. Because user explicitly enforces it -> meaning BW of extra paths are not zeroes. I even added a test checking that requirement. That means that on UCP side we need to calculate a bandwidth ratio for each non-first path.
Currently for 4 paths these ratios are: [0.9, 0.057, 0.029, 0.014]
It would work with absolute values as well, but requires double work on each side. First we calculate absolute BW value on the UCT side (for shared and dedicated BW), then we calculate ratio from it on the UCP side. So to me it seems more logical just to pass ratio, so no double work is needed.

The second use case is debatable.
What if iface BW is capped by GPU distance? Let's say we have CX7 with iface BW=25GBps, single-path BW=20GBps, but we are capped by a GPU distance=10GBps. Then I'm not sure what is the desired behaviour:

  • We still select 2 lanes, even if single-path can handle the capped BW. This is current behavior, and this is use case for using ratios - because we still can split the capped BW proportionally between paths.
  • We select just 1 lane. Then we see that is capable of handling 10GB alone, and don't select more lanes. In this case absolute single-path value is preferable.

I guess we still want to keep the existing behavior here (= select 2 lanes)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as agreed we go on with absolute BW.

if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_RATIO) {
if (uct_ib_iface_is_roce(ib_iface)) {
/* ROCE: Equal share per each path */
perf_attr->path_ratio = 1.0 / (double)iface_attr.dev_num_paths;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for (double)

} else if (uct_ib_iface_port_attr(ib_iface)->active_speed ==
UCT_IB_SPEED_NDR) {
/* CX7: first path consumes 90% of the full bandwidth */
perf_attr->path_ratio = 0.9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the upper limit should be absolute number. and not related to port link speed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my tests (ib_read_bw and osu_bw) the single-path BW is ~97% of the iface full BW.
Should we hardcode it to be 26e9 bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed, I measure this value on CX7 on 400Gbps setup and hardcode it here (= MAX_SINGLE_PATH)
Then the resulting
single_path_bandwidth = ucs_min(MAX_SINGLE_PATH, 0.95 * iface.bandwidth);

For other IB devices single path BW = full BW. And we can handle it properly on UCP layer

@@ -548,6 +548,10 @@ uct_mm_estimate_perf(uct_iface_h tl_iface, uct_perf_attr_t *perf_attr)
perf_attr->bandwidth.dedicated = iface->super.config.bandwidth;
}

if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if perf_attr->field_mask & UCT_PERF_ATTR_FIELD_BANDWIDTH is not requested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -30,6 +30,20 @@
#include <poll.h>


/**
* Maximum bandwidth of CX7 single path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with PCIe Gen5 and RDMA_READ op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines 40 to 42
* The minimal ratio is used to calculate the ratio for the first device path,
* when the full interface bandwidth is capped by PCI distance. In this case
* single path still does not consume the full interface bandwidth, but around
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can just mention with PCIe Gen4
also is it relevant for redma_read only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done for both
Yes, according to my measurement it's relevant only to RDMA_READ ops

if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH) {
if (uct_ib_iface_is_roce(ib_iface) &&
(uct_ib_iface_roce_lag_level(ib_iface) > 1)) {
/* For RoCE devices split iface bandwidth equally between paths */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove the comment or mention LAG in it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 502 to 503
if ((perf_attr->field_mask & UCT_PERF_ATTR_FIELD_BANDWIDTH) ||
(perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems can be some inline func, because this check is used in many places now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 196 to 197
* For CX7 and other IB devices, the single path takes the most of the
* interface bandwidth, and the rest of the paths share the remaining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit confusing because it is defferent for CX-7 and others (like CX5, CX6)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this from documentation, it's more impl detail anyway

Comment on lines +218 to +219
fixed_first_lane = params->first.lane_type != params->middle.lane_type;
for (i = fixed_first_lane ? 1 : 0, num_fast_lanes = i; i < num_lanes; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor maybe have just 1 branch instead?
like
i = (params->first.lane_type != params->middle.lane_type) ? 1 : 0
also you can use just
i = (params->first.lane_type != params->middle.lane_type) but it does not look clean

{
ucp_rsc_index_t rsc_index = ucp_proto_common_get_rsc_index(params, lane);

ucs_assert(selection->length < UCP_PROTO_MAX_LANES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertv

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yosefe
Copy link
Contributor

yosefe commented Mar 18, 2025

wire compat test failure seems relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants