UCP/PROTO: Minimal version of protocol lane selection #10539

iyastreb · 2025-03-07T16:04:17Z

What?

This is the minimal version of #10508

Formerly known as "lane sorting" task.
When CUDA context is set, then rma_bw_lanes array is adjusted with GPU distance.
When CUDA context is not set in the caller thread, then UCX protocol does not always choose fastest lanes for GPU memory.

The idea is to select lanes at the protocol selection stage after performance estimation.
We need to find the best combination of lanes for a given operation.

Testing

https://confluence.nvidia.com/display/NSWX/Protocol+lane+selection+testing

Mock tests PR: #10547

brminich · 2025-03-14T12:18:18Z

src/uct/ib/base/ib_iface.c

+            perf_attr->path_ratio = 0.9;
+        } else {
+            /* Others: first path consumes 99% of the full bandwidth */
+            perf_attr->path_ratio = (iface_attr.dev_num_paths > 1)? 0.99 : 1.0;


why is it different from non-LAG roce?

To support use case of CX6 with multiple paths.
The difference will be important at performance aggregation step (not implemented in this PR). The idea of perf aggregation is: after lanes are selected, we calculate the efficient usage of full interface bandwidth.
Efficient BW = [sum_of_selected_ratios] * path_ratio * full_bw
This formula should work both for all devices: CX6, CX7, ROCE

For ROCE device the efficient usage = num_path * path_ratio * full_bw
Because it's composed of equal parts of the same BW.
E.g. if we selected 2/10 ROCE paths, then efficient BW = 20% of the full iface BW

For CX6 it's different, if we do IB_NUM_PATHS=10 and select 2 of them, then the
efficient BW = [0.99+0.0..] = 99+% of the full iface BW

brminich · 2025-03-14T12:18:46Z

src/uct/ib/base/ib_iface.c

+    if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_RATIO) {
+        if (uct_ib_iface_is_roce(ib_iface)) {
+            /* ROCE: Equal share per each path */
+            perf_attr->path_ratio = 1.0 / (double)iface_attr.dev_num_paths;


i'd use this formula for all cases except cx-7 non-lag

no need for (double)

brminich · 2025-03-14T15:12:01Z

src/ucp/proto/proto_multi.c

+    }
+
+    /* Select all available indexes */
+    index_map = UCS_BIT(num_lanes) - 1;


Suggested change

index_map = UCS_BIT(num_lanes) - 1;

index_map = UCS_MASK(num_lanes);

brminich · 2025-03-14T15:14:37Z

src/ucp/proto/proto_multi.c

+                               ucp_rsc_index_t rsc_index)
+{
+    ucp_worker_iface_t *wiface = ucp_worker_iface(params->worker, rsc_index);
+    unsigned dev_num_paths     = wiface->attr.dev_num_paths;


we need to use number of selected pathes not number of supported

This is a valid option as well, it was initially implemented like that, since it's simpler.
The difference between the 2 options are:

If we know only selected paths count and don't know the overall paths count, then we should always keep a ratio gap for potentially remaining paths. E.g. for 2 paths it gives us these ratios:
path0=0.9, path1=0.067
so we keep 0.033 for potential next paths (we don't know how many we have)

If we rely on dev_num_paths, then we can split the ratios between path so that there is no gap for remaining paths. For example:

2 paths: path0=0.9, path1=0.1 3 paths: path0=0.9, path1=0.067, path2=0.033

yosefe · 2025-03-17T06:06:07Z

src/uct/api/v2/uct_v2.h

+    /**
+     * Single path ratio of the full bandwidth.
+     */
+    double              path_ratio;


IMO we should expose single path bandwidth directly and not as a ratio

For sure it can be done with absolute BW value.

But let's take a look at the use case 1: we specify non-default IB_NUM_PATHS=4 MAX_RNDV_LANES=4, and my expectation (and current behaviour) is that all these 4 paths are selected by the algorithm. Because user explicitly enforces it -> meaning BW of extra paths are not zeroes. I even added a test checking that requirement. That means that on UCP side we need to calculate a bandwidth ratio for each non-first path.
Currently for 4 paths these ratios are: [0.9, 0.057, 0.029, 0.014]
It would work with absolute values as well, but requires double work on each side. First we calculate absolute BW value on the UCT side (for shared and dedicated BW), then we calculate ratio from it on the UCP side. So to me it seems more logical just to pass ratio, so no double work is needed.

The second use case is debatable.
What if iface BW is capped by GPU distance? Let's say we have CX7 with iface BW=25GBps, single-path BW=20GBps, but we are capped by a GPU distance=10GBps. Then I'm not sure what is the desired behaviour:

We still select 2 lanes, even if single-path can handle the capped BW. This is current behavior, and this is use case for using ratios - because we still can split the capped BW proportionally between paths.

We select just 1 lane. Then we see that is capable of handling 10GB alone, and don't select more lanes. In this case absolute single-path value is preferable.

I guess we still want to keep the existing behavior here (= select 2 lanes)?

Ok, as agreed we go on with absolute BW.

yosefe · 2025-03-17T06:07:13Z

src/uct/ib/base/ib_iface.c

+    if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_RATIO) {
+        if (uct_ib_iface_is_roce(ib_iface)) {
+            /* ROCE: Equal share per each path */
+            perf_attr->path_ratio = 1.0 / (double)iface_attr.dev_num_paths;


no need for (double)

yosefe · 2025-03-17T06:07:55Z

src/uct/ib/base/ib_iface.c

+        } else if (uct_ib_iface_port_attr(ib_iface)->active_speed ==
+                   UCT_IB_SPEED_NDR) {
+            /* CX7: first path consumes 90% of the full bandwidth */
+            perf_attr->path_ratio = 0.9;


the upper limit should be absolute number. and not related to port link speed

According to my tests (ib_read_bw and osu_bw) the single-path BW is ~97% of the iface full BW.
Should we hardcode it to be 26e9 bytes?

As agreed, I measure this value on CX7 on 400Gbps setup and hardcode it here (= MAX_SINGLE_PATH)
Then the resulting
single_path_bandwidth = ucs_min(MAX_SINGLE_PATH, 0.95 * iface.bandwidth);

For other IB devices single path BW = full BW. And we can handle it properly on UCP layer

brminich · 2025-03-18T15:48:35Z

src/uct/sm/mm/base/mm_iface.c

@@ -548,6 +548,10 @@ uct_mm_estimate_perf(uct_iface_h tl_iface, uct_perf_attr_t *perf_attr)
        perf_attr->bandwidth.dedicated = iface->super.config.bandwidth;
    }

+    if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH) {


what if perf_attr->field_mask & UCT_PERF_ATTR_FIELD_BANDWIDTH is not requested?

brminich · 2025-03-18T15:49:20Z

src/uct/ib/base/ib_iface.c

@@ -30,6 +30,20 @@
 #include <poll.h>


+/**
+ * Maximum bandwidth of CX7 single path.


with PCIe Gen5 and RDMA_READ op

brminich · 2025-03-18T15:50:50Z

src/uct/ib/base/ib_iface.c

+ * The minimal ratio is used to calculate the ratio for the first device path,
+ * when the full interface bandwidth is capped by PCI distance. In this case
+ * single path still does not consume the full interface bandwidth, but around


can just mention with PCIe Gen4
also is it relevant for redma_read only?

done for both
Yes, according to my measurement it's relevant only to RDMA_READ ops

brminich · 2025-03-18T15:51:31Z

src/uct/ib/base/ib_iface.c

+    if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH) {
+        if (uct_ib_iface_is_roce(ib_iface) &&
+            (uct_ib_iface_roce_lag_level(ib_iface) > 1)) {
+            /* For RoCE devices split iface bandwidth equally between paths */


maybe remove the comment or mention LAG in it

brminich · 2025-03-18T15:53:46Z

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c

+    if ((perf_attr->field_mask & UCT_PERF_ATTR_FIELD_BANDWIDTH) ||
+        (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH)) {


seems can be some inline func, because this check is used in many places now

brminich · 2025-03-18T15:54:54Z

src/uct/api/v2/uct_v2.h

+     * For CX7 and other IB devices, the single path takes the most of the
+     * interface bandwidth, and the rest of the paths share the remaining


a bit confusing because it is defferent for CX-7 and others (like CX5, CX6)

I removed this from documentation, it's more impl detail anyway

brminich · 2025-03-18T15:56:49Z

src/ucp/proto/proto_multi.c

+    fixed_first_lane = params->first.lane_type != params->middle.lane_type;
+    for (i = fixed_first_lane ? 1 : 0, num_fast_lanes = i; i < num_lanes; ++i) {


minor maybe have just 1 branch instead?
like
i = (params->first.lane_type != params->middle.lane_type) ? 1 : 0
also you can use just
i = (params->first.lane_type != params->middle.lane_type) but it does not look clean

brminich · 2025-03-18T15:59:55Z

src/ucp/proto/proto_multi.c

+{
+    ucp_rsc_index_t rsc_index = ucp_proto_common_get_rsc_index(params, lane);
+
+    ucs_assert(selection->length < UCP_PROTO_MAX_LANES);


yosefe · 2025-03-18T18:24:29Z

wire compat test failure seems relevant

UCP/PROTO: Minimal version of protocol lane selection

a481bee

iyastreb force-pushed the ucp/proto/lane-selection-mini branch from 1987091 to a481bee Compare March 10, 2025 07:38

brminich previously approved these changes Mar 10, 2025

View reviewed changes

UCP/PROTO: Removed sorting

2c4c676

iyastreb dismissed brminich’s stale review via 2c4c676 March 13, 2025 09:35

UCT: Added UCT IB path_ratio

3450a3b

brminich reviewed Mar 14, 2025

View reviewed changes

yosefe reviewed Mar 17, 2025

View reviewed changes

iyastreb added 2 commits March 17, 2025 13:41

UCP/PROTO: Addressed PR comments

7207693

UCT: Absolute BW value for single path

c240d0c

brminich reviewed Mar 18, 2025

View reviewed changes

iyastreb added 2 commits March 20, 2025 14:05

UCP: Merge branch 'master' into ucp/proto/lane-selection-mini

3324204

UCP/PROTO: Addressed PR comments

1449338

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCP/PROTO: Minimal version of protocol lane selection #10539

UCP/PROTO: Minimal version of protocol lane selection #10539

iyastreb commented Mar 7, 2025 •

edited

Loading

brminich Mar 14, 2025

iyastreb Mar 14, 2025

brminich Mar 14, 2025

yosefe Mar 17, 2025

brminich Mar 14, 2025

brminich Mar 14, 2025

iyastreb Mar 14, 2025

yosefe Mar 17, 2025

iyastreb Mar 17, 2025

iyastreb Mar 17, 2025

yosefe Mar 17, 2025

yosefe Mar 17, 2025

iyastreb Mar 17, 2025

iyastreb Mar 17, 2025

brminich Mar 18, 2025

iyastreb Mar 20, 2025

brminich Mar 18, 2025

iyastreb Mar 20, 2025

brminich Mar 18, 2025

iyastreb Mar 20, 2025

brminich Mar 18, 2025

iyastreb Mar 20, 2025

brminich Mar 18, 2025

iyastreb Mar 20, 2025

brminich Mar 18, 2025

iyastreb Mar 20, 2025

brminich Mar 18, 2025

brminich Mar 18, 2025

iyastreb Mar 20, 2025

yosefe commented Mar 18, 2025

	index_map = UCS_BIT(num_lanes) - 1;
	index_map = UCS_MASK(num_lanes);

		if ((perf_attr->field_mask & UCT_PERF_ATTR_FIELD_BANDWIDTH) \|\|
		(perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH)) {

		* For CX7 and other IB devices, the single path takes the most of the
		* interface bandwidth, and the rest of the paths share the remaining

		fixed_first_lane = params->first.lane_type != params->middle.lane_type;
		for (i = fixed_first_lane ? 1 : 0, num_fast_lanes = i; i < num_lanes; ++i) {

UCP/PROTO: Minimal version of protocol lane selection #10539

Are you sure you want to change the base?

UCP/PROTO: Minimal version of protocol lane selection #10539

Conversation

iyastreb commented Mar 7, 2025 • edited Loading

What?

Testing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosefe commented Mar 18, 2025

iyastreb commented Mar 7, 2025 •

edited

Loading