-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api: Update internal send/recv function signatures #820
base: master
Are you sure you want to change the base?
Conversation
17a7d12
to
a98169e
Compare
e03051a
to
484bf7d
Compare
We've already adopted interfaces to v9 API. This change updates the internal functions to match latest interfaces. Signed-off-by: Ye Xiang <[email protected]>
ncclResult_t nccl_net_ofi_isend_v9(void* sendComm, void* data, size_t size, | ||
int tag, void* mhandle, void** request) | ||
{ | ||
return nccl_net_ofi_isend(sendComm, data, size, tag, mhandle, request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The isend_v9 interface looks the same as the isend interface. Do we really need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the two we shouldn't need. We have confusion around APIs every time we need to add / change an interface function.
What we have been doing (mostly, we've screwed this up in the past) is that the latest version API we support gets the un-versioned names. When a function interface changes, we add a function suffixed with _v[N-1]
and update all the previous code blocks. To me, this is not intuitive, and I don't love that we have to update all the old interfaces in an error-prone process.
I think we should change our operations. Every API function should have a version suffix of the API version in which the function was first used. When a function's prototype or behavior changes, we add a new version of the api with a version suffix of the API version in which the change occurred. Then we only have to copy the v[N-1] block to a vN block, change the few functions in the new version that changed, and not touch the past at all. I think this is more intuitive, but also means we don't change the past, which gives me a bit of comfort.
What I think we should do with this patch is not touch the unversioned interface functions. Just have this be changing the core interface from int to size_t, removing the overflow check for the size_t -> int cast, and adding the handling to old functions to pass a size_t array instead of an int array into the internal recv function.
In another patch, we should rename all the old functions to follow the "first time added" behavior, so we can get rid of some of this version madness.
@rajachan thoughts?
@@ -5773,7 +5773,7 @@ static inline int check_post_rx_buff_req(nccl_net_ofi_rdma_req_t *rx_buff_req) | |||
* @brief Send a message. This "interface function" is called, indirectly, from | |||
* the application | |||
*/ | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int tag, | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, size_t size, int tag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just glancing at uses of size, it looks like we pass it into tracing and so into LTTNG (for send, NCCL_OFI_TRACE_SEND
), so we should make sure this patch works with LTTNG enabled (--with-lttng
). I'm guessing we'll get some complaint about trying to fit a size_t into an int, since we haven't updated LTTNG yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch should update LTTNG as well, in that case.
} | ||
for (size_t i = 0; i < n; i++) { | ||
if (OFI_UNLIKELY(sizes[i] < 0)) { | ||
NCCL_OFI_WARN("Message size %d can't be negative at index %zu", sizes[i], i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't bozo check in the critical path. You shouldn't need this check function any more.
ncclResult_t nccl_net_ofi_isend_v9(void* sendComm, void* data, size_t size, | ||
int tag, void* mhandle, void** request) | ||
{ | ||
return nccl_net_ofi_isend(sendComm, data, size, tag, mhandle, request); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the two we shouldn't need. We have confusion around APIs every time we need to add / change an interface function.
What we have been doing (mostly, we've screwed this up in the past) is that the latest version API we support gets the un-versioned names. When a function interface changes, we add a function suffixed with _v[N-1]
and update all the previous code blocks. To me, this is not intuitive, and I don't love that we have to update all the old interfaces in an error-prone process.
I think we should change our operations. Every API function should have a version suffix of the API version in which the function was first used. When a function's prototype or behavior changes, we add a new version of the api with a version suffix of the API version in which the change occurred. Then we only have to copy the v[N-1] block to a vN block, change the few functions in the new version that changed, and not touch the past at all. I think this is more intuitive, but also means we don't change the past, which gives me a bit of comfort.
What I think we should do with this patch is not touch the unversioned interface functions. Just have this be changing the core interface from int to size_t, removing the overflow check for the size_t -> int cast, and adding the handling to old functions to pass a size_t array instead of an int array into the internal recv function.
In another patch, we should rename all the old functions to follow the "first time added" behavior, so we can get rid of some of this version madness.
@rajachan thoughts?
@@ -5773,7 +5773,7 @@ static inline int check_post_rx_buff_req(nccl_net_ofi_rdma_req_t *rx_buff_req) | |||
* @brief Send a message. This "interface function" is called, indirectly, from | |||
* the application | |||
*/ | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, int size, int tag, | |||
static int send(nccl_net_ofi_send_comm_t *send_comm, void *data, size_t size, int tag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch should update LTTNG as well, in that case.
Description of changes:
We've already adopted interfaces to v9 API. This change updates the internal send/recv function definitions to match latest interfaces.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.