Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: Update internal send/recv function signatures #820

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yexiang-aws
Copy link
Contributor

Description of changes:

We've already adopted interfaces to v9 API. This change updates the internal send/recv function definitions to match latest interfaces.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@yexiang-aws yexiang-aws force-pushed the v9_api branch 2 times, most recently from 17a7d12 to a98169e Compare March 21, 2025 16:16
@yexiang-aws yexiang-aws marked this pull request as ready for review March 21, 2025 16:17
@yexiang-aws yexiang-aws requested review from bwbarrett and a team as code owners March 21, 2025 16:17
@yexiang-aws yexiang-aws force-pushed the v9_api branch 2 times, most recently from e03051a to 484bf7d Compare March 21, 2025 16:21
@yexiang-aws yexiang-aws requested a review from a-szegel as a code owner March 21, 2025 16:21
@yexiang-aws yexiang-aws marked this pull request as draft March 21, 2025 16:22
@yexiang-aws yexiang-aws removed the request for review from a-szegel March 21, 2025 16:23
We've already adopted interfaces to v9 API. This change updates the
internal functions to match latest interfaces.

Signed-off-by: Ye Xiang <[email protected]>
@yexiang-aws yexiang-aws marked this pull request as ready for review March 21, 2025 16:25
@yexiang-aws yexiang-aws requested a review from rajachan March 21, 2025 16:26
@yexiang-aws yexiang-aws changed the title api: Update internal send/recv function definitions api: Update internal send/recv function signatures Mar 21, 2025
Comment on lines +736 to 740
ncclResult_t nccl_net_ofi_isend_v9(void* sendComm, void* data, size_t size,
int tag, void* mhandle, void** request)
{
return nccl_net_ofi_isend(sendComm, data, size, tag, mhandle, request);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The isend_v9 interface looks the same as the isend interface. Do we really need both?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the two we shouldn't need. We have confusion around APIs every time we need to add / change an interface function.

What we have been doing (mostly, we've screwed this up in the past) is that the latest version API we support gets the un-versioned names. When a function interface changes, we add a function suffixed with _v[N-1] and update all the previous code blocks. To me, this is not intuitive, and I don't love that we have to update all the old interfaces in an error-prone process.

I think we should change our operations. Every API function should have a version suffix of the API version in which the function was first used. When a function's prototype or behavior changes, we add a new version of the api with a version suffix of the API version in which the change occurred. Then we only have to copy the v[N-1] block to a vN block, change the few functions in the new version that changed, and not touch the past at all. I think this is more intuitive, but also means we don't change the past, which gives me a bit of comfort.

What I think we should do with this patch is not touch the unversioned interface functions. Just have this be changing the core interface from int to size_t, removing the overflow check for the size_t -> int cast, and adding the handling to old functions to pass a size_t array instead of an int array into the internal recv function.

In another patch, we should rename all the old functions to follow the "first time added" behavior, so we can get rid of some of this version madness.

@rajachan thoughts?

@@ -5773,7 +5773,7 @@ static inline int check_post_rx_buff_req(nccl_net_ofi_rdma_req_t *rx_buff_req)
* @brief Send a message. This "interface function" is called, indirectly, from
* the application
*/
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int tag,
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, size_t size, int tag,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just glancing at uses of size, it looks like we pass it into tracing and so into LTTNG (for send, NCCL_OFI_TRACE_SEND), so we should make sure this patch works with LTTNG enabled (--with-lttng). I'm guessing we'll get some complaint about trying to fit a size_t into an int, since we haven't updated LTTNG yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch should update LTTNG as well, in that case.

}
for (size_t i = 0; i < n; i++) {
if (OFI_UNLIKELY(sizes[i] < 0)) {
NCCL_OFI_WARN("Message size %d can't be negative at index %zu", sizes[i], i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bozo check in the critical path. You shouldn't need this check function any more.

Comment on lines +736 to 740
ncclResult_t nccl_net_ofi_isend_v9(void* sendComm, void* data, size_t size,
int tag, void* mhandle, void** request)
{
return nccl_net_ofi_isend(sendComm, data, size, tag, mhandle, request);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the two we shouldn't need. We have confusion around APIs every time we need to add / change an interface function.

What we have been doing (mostly, we've screwed this up in the past) is that the latest version API we support gets the un-versioned names. When a function interface changes, we add a function suffixed with _v[N-1] and update all the previous code blocks. To me, this is not intuitive, and I don't love that we have to update all the old interfaces in an error-prone process.

I think we should change our operations. Every API function should have a version suffix of the API version in which the function was first used. When a function's prototype or behavior changes, we add a new version of the api with a version suffix of the API version in which the change occurred. Then we only have to copy the v[N-1] block to a vN block, change the few functions in the new version that changed, and not touch the past at all. I think this is more intuitive, but also means we don't change the past, which gives me a bit of comfort.

What I think we should do with this patch is not touch the unversioned interface functions. Just have this be changing the core interface from int to size_t, removing the overflow check for the size_t -> int cast, and adding the handling to old functions to pass a size_t array instead of an int array into the internal recv function.

In another patch, we should rename all the old functions to follow the "first time added" behavior, so we can get rid of some of this version madness.

@rajachan thoughts?

@@ -5773,7 +5773,7 @@ static inline int check_post_rx_buff_req(nccl_net_ofi_rdma_req_t *rx_buff_req)
* @brief Send a message. This "interface function" is called, indirectly, from
* the application
*/
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int tag,
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, size_t size, int tag,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch should update LTTNG as well, in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants