Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: Update cuda_copy perf estimates for Grace #10155

Merged
merged 1 commit into from
Mar 19, 2025

Conversation

SeyedMir
Copy link
Contributor

What

Update cuda_copy perf estimates for Grace-Hopper

Why ?

The bandwidth and latency values will be different for PCIe versus C2C links that connect CPU and GPU.

How ?

Update the cuda_copy bw config and UCX_CUDA_COPY_BW.

@SeyedMir SeyedMir force-pushed the cuda-estimate-perf-conf branch from 9f8dc24 to f6964ee Compare September 18, 2024 17:24
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2h)},
{"d2d", "device to device bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2d)},
{"other", "any other src-dest memory types bandwidth",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor

Suggested change
{"other", "any other src-dest memory types bandwidth",
{"other", "any other memory types combinations bandwidth",

ucs_offsetof(uct_cuda_copy_iface_config_t, bandwidth), UCS_CONFIG_TYPE_BW},
/* TODO: 1. Add separate keys for shared and dedicated bandwidth
2. Remove the "other" key (use pref_loc for managed memory) */
{"BW", "h2d:8300MBs,d2h:11660MBs,d2d:320GBs,other:10000MBs",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that possible to remove other and define value for other memory types combinations through value without label?

Suggested change
{"BW", "h2d:8300MBs,d2h:11660MBs,d2d:320GBs,other:10000MBs",
{"BW", "10000MBs,h2d:8300MBs,d2h:11660MBs,d2d:320GBs",

@@ -87,7 +92,12 @@ typedef struct uct_cuda_copy_iface_config {
uct_iface_config_t super;
unsigned max_poll;
unsigned max_cuda_events;
double bandwidth;
struct {
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeyedMir why not use bw[UCS_MEMORY_TYPE_LAST][UCS_MEMORY_TYPE_LAST] and avoid explicit fields for each direction? This way we don't have to introduce other field and we can populate the bandwidths by referring to the specific source-destination combination. For example, replace:

         {"h2d", "host to device bandwidth",
          ucs_offsetof(uct_cuda_copy_iface_config_t, bw.h2d)},
          ...

with

         {"h2d", "host to device bandwidth",
          ucs_offsetof(uct_cuda_copy_iface_config_t, bw.[UCS_MEMORY_TYPE_UNKNOWN][UCS_MEMORY_TYPE_CUDA])},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing it that way initially in fact. But, then I thought that bw matrix will have entries for types that are completely irrelevant to CUDA, and there will be much more entries than the four in this struct. Having said that, I'm not strongly against using the matrix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe define a common struct type to also use it in uct_cuda_copy_iface_t?

@SeyedMir SeyedMir force-pushed the cuda-estimate-perf-conf branch from 82eaa85 to b8ef03b Compare October 4, 2024 14:48
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2h)},
{"d2d", "device to device bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bw.d2d)},
{"other", "any other memory types combinations bandwidth",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for other key - you can set default value without any key (see https://github.com/openucx/ucx/blob/master/src/uct/cuda/gdr_copy/gdr_copy_iface.c#L27)

"Effective memory bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bandwidth), UCS_CONFIG_TYPE_BW},
/* TODO: 1. Add separate keys for shared and dedicated bandwidth
2. Remove the "other" key (use pref_loc for managed memory) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other can be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeyedMir do we still need "other"?
maybe add h2h explicitly?
or rename it to "default"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need "other" because the existing code has a branch for each of h2d, d2h, d2d, and "else". The bw value set for h2d and "else" are not the same. My understanding is that we don't want this PR to change the already existing bw settings for non-grace platforms.
I can change it from "other" to "default", but I personally find "default" a bit confusing because the default value of the keys are set in line 42.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying my old comment here

no need for other key - you can set default value without any key (see https://github.com/openucx/ucx/blob/master/src/uct/cuda/gdr_copy/gdr_copy_iface.c#L27)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich
as @SeyedMir already mentioned: we need a key to set the value, that is used to initialize bw performance attribute in "else" case (not in [h2d, d2h, d2d]).

You can set default values for the existing tags. E.g. UCX_CUDA_COPY_BW=12345 sets 12345 to all declared tags.

We have four values to configure in the existing code of performance attributes query. If we want to keep old behavior, and just extend it, then we need to add four tags to the existing configuration variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok with 4-th tag then. IMO either is ok (other or default)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeyedMir can you pls change "other" to "default"?
the Grace config can be probably specified as
UCX_CUDA_COPY_BW=800GBs,d2d:3TBs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use default because it's a keyword. What else do you suggest?

@@ -134,7 +144,7 @@ static ucs_status_t uct_cuda_copy_iface_query(uct_iface_h tl_iface,

iface_attr->latency = UCT_CUDA_COPY_IFACE_LATENCY;
iface_attr->bandwidth.dedicated = 0;
iface_attr->bandwidth.shared = iface->config.bandwidth;
iface_attr->bandwidth.shared = iface->config.bw.other;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can use just constant here to preserve wire compatibility and avoid adding other key?
@yosefe, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich I removed the other key and added a constant instead.

@SeyedMir SeyedMir force-pushed the cuda-estimate-perf-conf branch 2 times, most recently from 08e9d57 to ca7e266 Compare January 31, 2025 00:46
@@ -407,16 +416,17 @@ uct_cuda_copy_estimate_perf(uct_iface_h tl_iface, uct_perf_attr_t *perf_attr)
perf_attr->bandwidth.dedicated = 0;
if ((src_mem_type == UCS_MEMORY_TYPE_HOST) &&
(dst_mem_type == UCS_MEMORY_TYPE_CUDA)) {
perf_attr->bandwidth.shared = (zcopy ? 8300.0 : 7900.0) * UCS_MBYTE;
perf_attr->bandwidth.shared = zcopy ? iface->config.bw.h2d :
iface->config.bw.h2d * 0.95;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does 0.95 come from overhead of cuStreamSynchronize for short protocols?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe smth like
stream_sync_factor = zcopy ? 1 : 0.95;
...
perf_attr->bandwidth.shared = iface->config.bw.h2d * stream_sync_factor;

@yosefe
Copy link
Contributor

yosefe commented Feb 24, 2025

/azp run UCX PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

brminich
brminich previously approved these changes Feb 24, 2025
"Effective memory bandwidth",
ucs_offsetof(uct_cuda_copy_iface_config_t, bandwidth), UCS_CONFIG_TYPE_BW},
/* TODO: Add separate keys for shared and dedicated bandwidth */
{"BW", "h2d:8300MBs,d2h:11660MBs,d2d:320GBs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SeyedMir can we introduce default field to ensure managed memory pipelines see c2c perf and not current default of 10 GB/s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the other key back to get c2c perf for managed memory. Also updated config values.
@brminich and @yosefe please let me know what you think.

@SeyedMir SeyedMir force-pushed the cuda-estimate-perf-conf branch from 2de18cf to 62aef2f Compare February 28, 2025 16:07
rakhmets
rakhmets previously approved these changes Mar 1, 2025
@SeyedMir SeyedMir changed the title UCT/CUDA: Update cuda_copy perf estimates for Grace-Hopper UCT/CUDA: Update cuda_copy perf estimates for Grace Mar 3, 2025
@SeyedMir
Copy link
Contributor Author

SeyedMir commented Mar 3, 2025

@yosefe @brminich some tests failed and they don't seem relevant to the changes in this PR AFAICT.

@SeyedMir
Copy link
Contributor Author

@yosefe please let me know if you have more comments or I should squash.

@SeyedMir SeyedMir force-pushed the cuda-estimate-perf-conf branch from c254298 to b3d018b Compare March 14, 2025 15:11
@yosefe
Copy link
Contributor

yosefe commented Mar 16, 2025

@SeyedMir pls squash

@SeyedMir SeyedMir force-pushed the cuda-estimate-perf-conf branch from b3d018b to adc9897 Compare March 17, 2025 15:46
@yosefe yosefe enabled auto-merge March 18, 2025 06:13
@yosefe yosefe merged commit e27378e into openucx:master Mar 19, 2025
151 checks passed
SeyedMir pushed a commit to SeyedMir/ucx that referenced this pull request Mar 20, 2025
UCT/CUDA: Update cuda_copy perf estimates for Grace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants