-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCP/PROTO: Minimal version of protocol lane selection #10539
base: master
Are you sure you want to change the base?
Conversation
1987091
to
a481bee
Compare
src/uct/ib/base/ib_iface.c
Outdated
perf_attr->path_ratio = 0.9; | ||
} else { | ||
/* Others: first path consumes 99% of the full bandwidth */ | ||
perf_attr->path_ratio = (iface_attr.dev_num_paths > 1)? 0.99 : 1.0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it different from non-LAG roce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support use case of CX6 with multiple paths.
The difference will be important at performance aggregation step (not implemented in this PR). The idea of perf aggregation is: after lanes are selected, we calculate the efficient usage of full interface bandwidth.
Efficient BW = [sum_of_selected_ratios] * path_ratio * full_bw
This formula should work both for all devices: CX6, CX7, ROCE
For ROCE device the efficient usage = num_path * path_ratio * full_bw
Because it's composed of equal parts of the same BW.
E.g. if we selected 2/10 ROCE paths, then efficient BW = 20% of the full iface BW
For CX6 it's different, if we do IB_NUM_PATHS=10 and select 2 of them, then the
efficient BW = [0.99+0.0..] = 99+% of the full iface BW
src/uct/ib/base/ib_iface.c
Outdated
if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_RATIO) { | ||
if (uct_ib_iface_is_roce(ib_iface)) { | ||
/* ROCE: Equal share per each path */ | ||
perf_attr->path_ratio = 1.0 / (double)iface_attr.dev_num_paths; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd use this formula for all cases except cx-7 non-lag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for (double)
src/ucp/proto/proto_multi.c
Outdated
} | ||
|
||
/* Select all available indexes */ | ||
index_map = UCS_BIT(num_lanes) - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index_map = UCS_BIT(num_lanes) - 1; | |
index_map = UCS_MASK(num_lanes); |
src/ucp/proto/proto_multi.c
Outdated
ucp_rsc_index_t rsc_index) | ||
{ | ||
ucp_worker_iface_t *wiface = ucp_worker_iface(params->worker, rsc_index); | ||
unsigned dev_num_paths = wiface->attr.dev_num_paths; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to use number of selected pathes not number of supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid option as well, it was initially implemented like that, since it's simpler.
The difference between the 2 options are:
- If we know only selected paths count and don't know the overall paths count, then we should always keep a ratio gap for potentially remaining paths. E.g. for 2 paths it gives us these ratios:
path0=0.9, path1=0.067
so we keep 0.033 for potential next paths (we don't know how many we have) - If we rely on dev_num_paths, then we can split the ratios between path so that there is no gap for remaining paths. For example:
2 paths: path0=0.9, path1=0.1
3 paths: path0=0.9, path1=0.067, path2=0.033
src/uct/api/v2/uct_v2.h
Outdated
/** | ||
* Single path ratio of the full bandwidth. | ||
*/ | ||
double path_ratio; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should expose single path bandwidth directly and not as a ratio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure it can be done with absolute BW value.
But let's take a look at the use case 1: we specify non-default IB_NUM_PATHS=4 MAX_RNDV_LANES=4
, and my expectation (and current behaviour) is that all these 4 paths are selected by the algorithm. Because user explicitly enforces it -> meaning BW of extra paths are not zeroes. I even added a test checking that requirement. That means that on UCP side we need to calculate a bandwidth ratio for each non-first path.
Currently for 4 paths these ratios are: [0.9, 0.057, 0.029, 0.014]
It would work with absolute values as well, but requires double work on each side. First we calculate absolute BW value on the UCT side (for shared and dedicated BW), then we calculate ratio from it on the UCP side. So to me it seems more logical just to pass ratio, so no double work is needed.
The second use case is debatable.
What if iface BW is capped by GPU distance? Let's say we have CX7 with iface BW=25GBps, single-path BW=20GBps, but we are capped by a GPU distance=10GBps. Then I'm not sure what is the desired behaviour:
- We still select 2 lanes, even if single-path can handle the capped BW. This is current behavior, and this is use case for using ratios - because we still can split the capped BW proportionally between paths.
- We select just 1 lane. Then we see that is capable of handling 10GB alone, and don't select more lanes. In this case absolute single-path value is preferable.
I guess we still want to keep the existing behavior here (= select 2 lanes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, as agreed we go on with absolute BW.
src/uct/ib/base/ib_iface.c
Outdated
if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_RATIO) { | ||
if (uct_ib_iface_is_roce(ib_iface)) { | ||
/* ROCE: Equal share per each path */ | ||
perf_attr->path_ratio = 1.0 / (double)iface_attr.dev_num_paths; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for (double)
src/uct/ib/base/ib_iface.c
Outdated
} else if (uct_ib_iface_port_attr(ib_iface)->active_speed == | ||
UCT_IB_SPEED_NDR) { | ||
/* CX7: first path consumes 90% of the full bandwidth */ | ||
perf_attr->path_ratio = 0.9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the upper limit should be absolute number. and not related to port link speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my tests (ib_read_bw and osu_bw) the single-path BW is ~97% of the iface full BW.
Should we hardcode it to be 26e9 bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As agreed, I measure this value on CX7 on 400Gbps setup and hardcode it here (= MAX_SINGLE_PATH)
Then the resulting
single_path_bandwidth = ucs_min(MAX_SINGLE_PATH, 0.95 * iface.bandwidth);
For other IB devices single path BW = full BW. And we can handle it properly on UCP layer
@@ -548,6 +548,10 @@ uct_mm_estimate_perf(uct_iface_h tl_iface, uct_perf_attr_t *perf_attr) | |||
perf_attr->bandwidth.dedicated = iface->super.config.bandwidth; | |||
} | |||
|
|||
if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if perf_attr->field_mask & UCT_PERF_ATTR_FIELD_BANDWIDTH
is not requested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/uct/ib/base/ib_iface.c
Outdated
@@ -30,6 +30,20 @@ | |||
#include <poll.h> | |||
|
|||
|
|||
/** | |||
* Maximum bandwidth of CX7 single path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with PCIe Gen5 and RDMA_READ op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
src/uct/ib/base/ib_iface.c
Outdated
* The minimal ratio is used to calculate the ratio for the first device path, | ||
* when the full interface bandwidth is capped by PCI distance. In this case | ||
* single path still does not consume the full interface bandwidth, but around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just mention with PCIe Gen4
also is it relevant for redma_read only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done for both
Yes, according to my measurement it's relevant only to RDMA_READ ops
src/uct/ib/base/ib_iface.c
Outdated
if (perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH) { | ||
if (uct_ib_iface_is_roce(ib_iface) && | ||
(uct_ib_iface_roce_lag_level(ib_iface) > 1)) { | ||
/* For RoCE devices split iface bandwidth equally between paths */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove the comment or mention LAG in it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
if ((perf_attr->field_mask & UCT_PERF_ATTR_FIELD_BANDWIDTH) || | ||
(perf_attr->field_mask & UCT_PERF_ATTR_FIELD_PATH_BANDWIDTH)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems can be some inline func, because this check is used in many places now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/uct/api/v2/uct_v2.h
Outdated
* For CX7 and other IB devices, the single path takes the most of the | ||
* interface bandwidth, and the rest of the paths share the remaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confusing because it is defferent for CX-7 and others (like CX5, CX6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this from documentation, it's more impl detail anyway
fixed_first_lane = params->first.lane_type != params->middle.lane_type; | ||
for (i = fixed_first_lane ? 1 : 0, num_fast_lanes = i; i < num_lanes; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor maybe have just 1 branch instead?
like
i = (params->first.lane_type != params->middle.lane_type) ? 1 : 0
also you can use just
i = (params->first.lane_type != params->middle.lane_type)
but it does not look clean
src/ucp/proto/proto_multi.c
Outdated
{ | ||
ucp_rsc_index_t rsc_index = ucp_proto_common_get_rsc_index(params, lane); | ||
|
||
ucs_assert(selection->length < UCP_PROTO_MAX_LANES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
wire compat test failure seems relevant |
What?
This is the minimal version of #10508
Formerly known as "lane sorting" task.
When CUDA context is set, then rma_bw_lanes array is adjusted with GPU distance.
When CUDA context is not set in the caller thread, then UCX protocol does not always choose fastest lanes for GPU memory.
The idea is to select lanes at the protocol selection stage after performance estimation.
We need to find the best combination of lanes for a given operation.
Testing
https://confluence.nvidia.com/display/NSWX/Protocol+lane+selection+testing
Mock tests PR: #10547