-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA_IPC: Context switching aware resource management #10538
base: master
Are you sure you want to change the base?
Conversation
39725eb
to
64966ab
Compare
unsigned long stream_refcount[UCT_CUDA_IPC_MAX_PEERS]; | ||
/* per stream outstanding ops */ | ||
ucs_queue_head_t active_queue; | ||
khash_t(cuda_ipc_queue_desc) queue_desc_map; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need a separate cudaEvents mpool per hash element?
92da0ff
to
c6bb534
Compare
reopen |
|
||
ucs_queue_for_each_safe(q_desc, iter, &iface->active_queue, queue) { | ||
event_q = &q_desc->event_queue; | ||
count += iface->ops.progress_queue(tl_iface, event_q, max_events - count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uct_cuda_copy_progress_event_queue and uct_cuda_ipc_progress_event_queue have more in common: they both go over the active queue, call cuEventQuery to check for completion. IMO the only difference is cuda_ipc calling uct_cuda_ipc_unmap_memhandle when completed.
IMO better pull that common code up to uct_cuda_base_iface_progress, also to improve performance and avoid extra function call.
if (!ucs_queue_is_empty(&q_desc->event_queue)) { | ||
UCT_TL_IFACE_STAT_FLUSH_WAIT(ucs_derived_of(tl_iface, | ||
uct_base_iface_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably no need to check q_desc->event_queue is not empty , enough to check iface->active_queue is not empty
{ | ||
uct_cuda_event_desc_t *base = obj; | ||
|
||
memset(base, 0 , sizeof(*base)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- code style
- why needed
- rename base to event_desc
UCT_CUDADRV_FUNC_LOG_WARN(cuStreamDestroy(*stream)); | ||
} | ||
|
||
static void uct_cuda_base_event_desc_cleanup(ucs_mpool_t *mp, void *obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it after uct_cuda_base_event_desc_init
return UCS_OK; | ||
|
||
err_free_ctx_rsc: | ||
ucs_free(ctx_rsc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to call destroy_rsc
} | ||
|
||
static UCS_F_ALWAYS_INLINE CUstream * | ||
uct_cuda_ipc_get_stream(uct_cuda_ipc_ctx_rsc_t *ctx_rsc, int dev_num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this wrapper is redundant
key->dev_num %= iface->config.max_streams; /* round-robin */ | ||
q_desc = &ctx_rsc->queue_desc[key->dev_num]; | ||
event_q = &q_desc->event_queue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO event_q var is redundant
ctx_rsc->queue_desc[i].stream = NULL; | ||
ucs_queue_head_init(&ctx_rsc->queue_desc[i].event_queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add helper function uct_cuda_base_queue_desc_init(), also used in cuda_copy.
it can be placed right before uct_cuda_base_queue_desc_destroy in the code
} | ||
|
||
static void | ||
uct_cuda_ipc_ctx_rsc_destroy(uct_iface_h tl_iface, uct_cuda_ctx_rsc_t *base) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't name variables just "base" because it can be different in different context
here, it can be "cuda_ctx_rsc"
typedef struct { | ||
uct_cuda_event_desc_t super; | ||
void *mapped_addr; | ||
unsigned stream_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it can be removed since was used for stream_refcount that was also removed
What?
Original PR: #9654
Currently CUDA_IPC transport uses integer stream_count to track outstanding work but in preparation for multi-device support, this PR moves to active_queue usage similar to cuda_copy transport. This will eventually also help unify more common code shared between cuda_ipc and cuda_copy when it comes to stream/event usage. This PR also removes max peer limitations.