Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_IPC: Context switching aware resource management #10538

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

iyastreb
Copy link
Contributor

@iyastreb iyastreb commented Mar 7, 2025

What?

Original PR: #9654

Currently CUDA_IPC transport uses integer stream_count to track outstanding work but in preparation for multi-device support, this PR moves to active_queue usage similar to cuda_copy transport. This will eventually also help unify more common code shared between cuda_ipc and cuda_copy when it comes to stream/event usage. This PR also removes max peer limitations.

@Akshay-Venkatesh
Copy link
Contributor

@iyastreb Are there any major design changes in this PR from #9654 ?

@iyastreb
Copy link
Contributor Author

iyastreb commented Mar 7, 2025

@iyastreb Are there any major design changes in this PR from #9654 ?

There are no design changes at all, only polishing and minor enhancements

@iyastreb iyastreb force-pushed the uct/cuda_ipc/active-queue branch from 39725eb to 64966ab Compare March 7, 2025 16:13
unsigned long stream_refcount[UCT_CUDA_IPC_MAX_PEERS];
/* per stream outstanding ops */
ucs_queue_head_t active_queue;
khash_t(cuda_ipc_queue_desc) queue_desc_map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a separate cudaEvents mpool per hash element?

@iyastreb iyastreb changed the title UCT/CUDA_IPC: Use active-queues to track outstanding work UCT/CUDA_IPC: Context switching aware resource management Mar 20, 2025
@iyastreb iyastreb closed this Mar 21, 2025
@iyastreb iyastreb force-pushed the uct/cuda_ipc/active-queue branch from 92da0ff to c6bb534 Compare March 21, 2025 08:19
@iyastreb
Copy link
Contributor Author

reopen

@iyastreb iyastreb reopened this Mar 21, 2025

ucs_queue_for_each_safe(q_desc, iter, &iface->active_queue, queue) {
event_q = &q_desc->event_queue;
count += iface->ops.progress_queue(tl_iface, event_q, max_events - count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uct_cuda_copy_progress_event_queue and uct_cuda_ipc_progress_event_queue have more in common: they both go over the active queue, call cuEventQuery to check for completion. IMO the only difference is cuda_ipc calling uct_cuda_ipc_unmap_memhandle when completed.
IMO better pull that common code up to uct_cuda_base_iface_progress, also to improve performance and avoid extra function call.

Comment on lines +208 to +210
if (!ucs_queue_is_empty(&q_desc->event_queue)) {
UCT_TL_IFACE_STAT_FLUSH_WAIT(ucs_derived_of(tl_iface,
uct_base_iface_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably no need to check q_desc->event_queue is not empty , enough to check iface->active_queue is not empty

{
uct_cuda_event_desc_t *base = obj;

memset(base, 0 , sizeof(*base));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. code style
  2. why needed
  3. rename base to event_desc

UCT_CUDADRV_FUNC_LOG_WARN(cuStreamDestroy(*stream));
}

static void uct_cuda_base_event_desc_cleanup(ucs_mpool_t *mp, void *obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it after uct_cuda_base_event_desc_init

return UCS_OK;

err_free_ctx_rsc:
ucs_free(ctx_rsc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to call destroy_rsc

}

static UCS_F_ALWAYS_INLINE CUstream *
uct_cuda_ipc_get_stream(uct_cuda_ipc_ctx_rsc_t *ctx_rsc, int dev_num)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this wrapper is redundant

key->dev_num %= iface->config.max_streams; /* round-robin */
q_desc = &ctx_rsc->queue_desc[key->dev_num];
event_q = &q_desc->event_queue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO event_q var is redundant

Comment on lines +461 to +462
ctx_rsc->queue_desc[i].stream = NULL;
ucs_queue_head_init(&ctx_rsc->queue_desc[i].event_queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add helper function uct_cuda_base_queue_desc_init(), also used in cuda_copy.
it can be placed right before uct_cuda_base_queue_desc_destroy in the code

}

static void
uct_cuda_ipc_ctx_rsc_destroy(uct_iface_h tl_iface, uct_cuda_ctx_rsc_t *base)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't name variables just "base" because it can be different in different context
here, it can be "cuda_ctx_rsc"

typedef struct {
uct_cuda_event_desc_t super;
void *mapped_addr;
unsigned stream_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it can be removed since was used for stream_refcount that was also removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants