New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

UCT/CUDA_IPC: Context switching aware resource management #10538

Open

iyastreb wants to merge 5 commits into openucx:master from iyastreb:uct/cuda_ipc/active-queue

Contributor

iyastreb commented Mar 7, 2025

What?

Original PR: #9654

Currently CUDA_IPC transport uses integer stream_count to track outstanding work but in preparation for multi-device support, this PR moves to active_queue usage similar to cuda_copy transport. This will eventually also help unify more common code shared between cuda_ipc and cuda_copy when it comes to stream/event usage. This PR also removes max peer limitations.

iyastreb requested review from yosefe, brminich and rakhmets

March 7, 2025 10:21

iyastreb mentioned this pull request

UCT/CUDA_IPC: Use active-queues to track outstanding work #9654

Open

rakhmets reviewed

View reviewed changes

src/uct/cuda/cuda_ipc/cuda_ipc_ep.c Outdated Show resolved Hide resolved

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved

Contributor

Akshay-Venkatesh commented Mar 7, 2025

@iyastreb Are there any major design changes in this PR from #9654 ?

Contributor Author

iyastreb commented Mar 7, 2025

@iyastreb Are there any major design changes in this PR from #9654 ?

There are no design changes at all, only polishing and minor enhancements

iyastreb force-pushed the uct/cuda_ipc/active-queue branch from 39725eb to 64966ab Compare

March 7, 2025 16:13

rakhmets reviewed

View reviewed changes

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved

brminich reviewed

View reviewed changes

src/uct/cuda/cuda_ipc/cuda_ipc_iface.h Outdated

-                  unsigned long                      stream_refcount[UCT_CUDA_IPC_MAX_PEERS];
-                                                                              /* per stream outstanding ops */
+                  ucs_queue_head_t                   active_queue;
+                  khash_t(cuda_ipc_queue_desc)       queue_desc_map;

Contributor

brminich Mar 10, 2025

do we need a separate cudaEvents mpool per hash element?

rakhmets reviewed

View reviewed changes

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Show resolved Hide resolved

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c Outdated Show resolved Hide resolved

iyastreb changed the title ~~UCT/CUDA_IPC: Use active-queues to track outstanding work~~ UCT/CUDA_IPC: Context switching aware resource management

iyastreb closed this

iyastreb force-pushed the uct/cuda_ipc/active-queue branch from 92da0ff to c6bb534 Compare

March 21, 2025 08:19


          UCT/CUDA_IPC: Moved common structs to base iface

aae4134

Contributor Author

iyastreb commented Mar 21, 2025

reopen

iyastreb reopened this

iyastreb added 4 commits

March 21, 2025 01:51


          UCT/CUDA: Common CUDA context hashmap structures

41aa83e


          UCT/CUDA: Common cuda create

efd42a4


          UCT/CUDA: Common cuda rsc destroy

e897950


          UCT/CUDA_IPC: Common event struct

5f9e2c3

yosefe reviewed

View reviewed changes

src/uct/cuda/base/cuda_iface.c

+                  ucs_queue_for_each_safe(q_desc, iter, &iface->active_queue, queue) {
+                      event_q = &q_desc->event_queue;
+                      count  += iface->ops.progress_queue(tl_iface, event_q, max_events - count);

Contributor

yosefe Mar 22, 2025

uct_cuda_copy_progress_event_queue and uct_cuda_ipc_progress_event_queue have more in common: they both go over the active queue, call cuEventQuery to check for completion. IMO the only difference is cuda_ipc calling uct_cuda_ipc_unmap_memhandle when completed.
IMO better pull that common code up to uct_cuda_base_iface_progress, also to improve performance and avoid extra function call.

src/uct/cuda/base/cuda_iface.c

Comment on lines +208 to +210

+                      if (!ucs_queue_is_empty(&q_desc->event_queue)) {
+                          UCT_TL_IFACE_STAT_FLUSH_WAIT(ucs_derived_of(tl_iface,
+                                                                      uct_base_iface_t));

Contributor

yosefe Mar 22, 2025

probably no need to check q_desc->event_queue is not empty , enough to check iface->active_queue is not empty

src/uct/cuda/base/cuda_iface.c

+              {
+                  uct_cuda_event_desc_t *base = obj;
+                  memset(base, 0 , sizeof(*base));

Contributor

yosefe Mar 22, 2025

code style
why needed
rename base to event_desc

src/uct/cuda/base/cuda_iface.c

+                  UCT_CUDADRV_FUNC_LOG_WARN(cuStreamDestroy(*stream));
+              }
+              static void uct_cuda_base_event_desc_cleanup(ucs_mpool_t *mp, void *obj)

Contributor

yosefe Mar 22, 2025

move it after uct_cuda_base_event_desc_init

src/uct/cuda/base/cuda_iface.c

+                  return UCS_OK;
+              err_free_ctx_rsc:
+                  ucs_free(ctx_rsc);

Contributor

yosefe Mar 22, 2025

need to call destroy_rsc

src/uct/cuda/cuda_ipc/cuda_ipc_ep.c

+              }
+              static UCS_F_ALWAYS_INLINE CUstream *
+              uct_cuda_ipc_get_stream(uct_cuda_ipc_ctx_rsc_t *ctx_rsc, int dev_num)

Contributor

yosefe Mar 22, 2025

IMO this wrapper is redundant

src/uct/cuda/cuda_ipc/cuda_ipc_ep.c

                   key->dev_num %= iface->config.max_streams; /* round-robin */
+                  q_desc        = &ctx_rsc->queue_desc[key->dev_num];
+                  event_q       = &q_desc->event_queue;

Contributor

yosefe Mar 22, 2025

IMO event_q var is redundant

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c

Comment on lines +461 to +462

		ctx_rsc->queue_desc[i].stream = NULL;
		ucs_queue_head_init(&ctx_rsc->queue_desc[i].event_queue);

Contributor

yosefe Mar 22, 2025

maybe add helper function uct_cuda_base_queue_desc_init(), also used in cuda_copy.
it can be placed right before uct_cuda_base_queue_desc_destroy in the code

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c

+              }
+              static void
+              uct_cuda_ipc_ctx_rsc_destroy(uct_iface_h tl_iface, uct_cuda_ctx_rsc_t *base)

Contributor

yosefe Mar 22, 2025

pls don't name variables just "base" because it can be different in different context
here, it can be "cuda_ctx_rsc"

src/uct/cuda/cuda_ipc/cuda_ipc_iface.h

+              typedef struct {
+                  uct_cuda_event_desc_t super;
+                  void                  *mapped_addr;
+                  unsigned              stream_id;

Contributor

yosefe Mar 22, 2025

i think it can be removed since was used for stream_refcount that was also removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet