-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA: Update cuda_copy perf estimates for Grace #10155
Conversation
9f8dc24
to
f6964ee
Compare
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2h)}, | ||
{"d2d", "device to device bandwidth", | ||
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2d)}, | ||
{"other", "any other src-dest memory types bandwidth", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor
{"other", "any other src-dest memory types bandwidth", | |
{"other", "any other memory types combinations bandwidth", |
ucs_offsetof(uct_cuda_copy_iface_config_t, bandwidth), UCS_CONFIG_TYPE_BW}, | ||
/* TODO: 1. Add separate keys for shared and dedicated bandwidth | ||
2. Remove the "other" key (use pref_loc for managed memory) */ | ||
{"BW", "h2d:8300MBs,d2h:11660MBs,d2d:320GBs,other:10000MBs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible to remove other and define value for other memory types combinations through value without label?
{"BW", "h2d:8300MBs,d2h:11660MBs,d2d:320GBs,other:10000MBs", | |
{"BW", "10000MBs,h2d:8300MBs,d2h:11660MBs,d2d:320GBs", |
@@ -87,7 +92,12 @@ typedef struct uct_cuda_copy_iface_config { | |||
uct_iface_config_t super; | |||
unsigned max_poll; | |||
unsigned max_cuda_events; | |||
double bandwidth; | |||
struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeyedMir why not use bw[UCS_MEMORY_TYPE_LAST][UCS_MEMORY_TYPE_LAST] and avoid explicit fields for each direction? This way we don't have to introduce other
field and we can populate the bandwidths by referring to the specific source-destination combination. For example, replace:
{"h2d", "host to device bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.h2d)},
...
with
{"h2d", "host to device bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.[UCS_MEMORY_TYPE_UNKNOWN][UCS_MEMORY_TYPE_CUDA])},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about doing it that way initially in fact. But, then I thought that bw matrix will have entries for types that are completely irrelevant to CUDA, and there will be much more entries than the four in this struct. Having said that, I'm not strongly against using the matrix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe define a common struct type to also use it in uct_cuda_copy_iface_t?
82eaa85
to
b8ef03b
Compare
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2h)}, | ||
{"d2d", "device to device bandwidth", | ||
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2d)}, | ||
{"other", "any other memory types combinations bandwidth", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for other
key - you can set default value without any key (see https://github.com/openucx/ucx/blob/master/src/uct/cuda/gdr_copy/gdr_copy_iface.c#L27)
"Effective memory bandwidth", | ||
ucs_offsetof(uct_cuda_copy_iface_config_t, bandwidth), UCS_CONFIG_TYPE_BW}, | ||
/* TODO: 1. Add separate keys for shared and dedicated bandwidth | ||
2. Remove the "other" key (use pref_loc for managed memory) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeyedMir do we still need "other"?
maybe add h2h explicitly?
or rename it to "default"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need "other" because the existing code has a branch for each of h2d, d2h, d2d, and "else". The bw value set for h2d and "else" are not the same. My understanding is that we don't want this PR to change the already existing bw settings for non-grace platforms.
I can change it from "other" to "default", but I personally find "default" a bit confusing because the default value of the keys are set in line 42.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copying my old comment here
no need for other key - you can set default value without any key (see https://github.com/openucx/ucx/blob/master/src/uct/cuda/gdr_copy/gdr_copy_iface.c#L27)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich
as @SeyedMir already mentioned: we need a key to set the value, that is used to initialize bw performance attribute in "else" case (not in [h2d, d2h, d2d]).
You can set default values for the existing tags. E.g. UCX_CUDA_COPY_BW=12345
sets 12345
to all declared tags.
We have four values to configure in the existing code of performance attributes query. If we want to keep old behavior, and just extend it, then we need to add four tags to the existing configuration variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok with 4-th tag then. IMO either is ok (other
or default
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeyedMir can you pls change "other" to "default"?
the Grace config can be probably specified as
UCX_CUDA_COPY_BW=800GBs,d2d:3TBs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use default
because it's a keyword. What else do you suggest?
@@ -134,7 +144,7 @@ static ucs_status_t uct_cuda_copy_iface_query(uct_iface_h tl_iface, | |||
|
|||
iface_attr->latency = UCT_CUDA_COPY_IFACE_LATENCY; | |||
iface_attr->bandwidth.dedicated = 0; | |||
iface_attr->bandwidth.shared = iface->config.bandwidth; | |||
iface_attr->bandwidth.shared = iface->config.bw.other; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can use just constant here to preserve wire compatibility and avoid adding other
key?
@yosefe, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich I removed the other
key and added a constant instead.
08e9d57
to
ca7e266
Compare
@@ -407,16 +416,17 @@ uct_cuda_copy_estimate_perf(uct_iface_h tl_iface, uct_perf_attr_t *perf_attr) | |||
perf_attr->bandwidth.dedicated = 0; | |||
if ((src_mem_type == UCS_MEMORY_TYPE_HOST) && | |||
(dst_mem_type == UCS_MEMORY_TYPE_CUDA)) { | |||
perf_attr->bandwidth.shared = (zcopy ? 8300.0 : 7900.0) * UCS_MBYTE; | |||
perf_attr->bandwidth.shared = zcopy ? iface->config.bw.h2d : | |||
iface->config.bw.h2d * 0.95; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does 0.95 come from overhead of cuStreamSynchronize
for short protocols?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe smth like
stream_sync_factor = zcopy ? 1 : 0.95;
...
perf_attr->bandwidth.shared = iface->config.bw.h2d * stream_sync_factor;
/azp run UCX PR |
Azure Pipelines successfully started running 1 pipeline(s). |
"Effective memory bandwidth", | ||
ucs_offsetof(uct_cuda_copy_iface_config_t, bandwidth), UCS_CONFIG_TYPE_BW}, | ||
/* TODO: Add separate keys for shared and dedicated bandwidth */ | ||
{"BW", "h2d:8300MBs,d2h:11660MBs,d2d:320GBs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SeyedMir can we introduce default
field to ensure managed memory pipelines see c2c perf and not current default of 10 GB/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2de18cf
to
62aef2f
Compare
ccb30da
to
935719d
Compare
@yosefe please let me know if you have more comments or I should squash. |
c254298
to
b3d018b
Compare
@SeyedMir pls squash |
b3d018b
to
adc9897
Compare
UCT/CUDA: Update cuda_copy perf estimates for Grace
What
Update cuda_copy perf estimates for Grace-Hopper
Why ?
The bandwidth and latency values will be different for PCIe versus C2C links that connect CPU and GPU.
How ?
Update the cuda_copy bw config and UCX_CUDA_COPY_BW.