Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA: Use sys_dev to allocate CUDA memory #10553

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Mar 14, 2025

What?

Re-adds existing API PR #10448 with latest comments addressed, and UCT implementation.

Why?

Fragments used for pipeline need to use the same GPU as the user GPU.

How?

Set context if needed, if not available find first usable primary context. If explicit context was requested and not found, return an error.

@tvegas1
Copy link
Contributor Author

tvegas1 commented Mar 14, 2025

@rakhmets, we can probably eventually reuse here the ctx retain cache from #10545 when it's merged?

@rakhmets
Copy link
Contributor

@rakhmets, we can probably eventually reuse here the ctx retain cache from #10545 when it's merged?

Yes, exactly.

return UCS_OK;
}

status = UCT_CUDADRV_FUNC(cuDevicePrimaryCtxRetain(&primary_ctx, device),
Copy link
Contributor

@rakhmets rakhmets Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to check the primary device context using cuDevicePrimaryCtxGetState. The main use case that should be covered when user allocates memory on the device I, then switch to device J, and we should allocate staging buffer on device I. Thus, if cuDevicePrimaryCtxGetState returns inactive state for the given device, then we should report an error instead of retaining and activating primary context on the device without active primary context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added function, and log

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets why is it considered an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this code will be modified to use common func for context retainig, so no need to review it now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the two function above and using uct_cuda_primary_ctx_retain()

@@ -61,6 +61,34 @@ void uct_cuda_base_get_sys_dev(CUdevice cuda_device,
*sys_dev_p = UCS_SYS_DEVICE_ID_UNKNOWN;
}

static CUdevice sys_dev_to_device[UCS_SYS_DEVICE_ID_MAX] = {
[0 ... UCS_SYS_DEVICE_ID_MAX - 1] = CU_DEVICE_INVALID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it gcc extension, but not a C99?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


class test_switch_cuda_device : public test_md {
public:
protected:
int num_devices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor:

Suggested change
int num_devices;
int m_num_devices;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

};

template<class T>
void test_switch_cuda_device::detect_mem_type(ucs_memory_type_t mem_type) const
{
int num_devices;
ASSERT_EQ(cudaGetDeviceCount(&num_devices), cudaSuccess);
int current_device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to move

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now done as part of init(), unless I miss something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant no need to move initialization of int current_device;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -90,6 +118,8 @@ uct_cuda_base_query_md_resources(uct_component_t *component,
continue;
}

sys_dev_to_device[sys_dev] = cuda_device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we populate this array in uct_cuda_base_get_sys_dev instead?
it is not expected that user will query md tl resources for every cudaDevice visible for the thread/process

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user does not need to query all devices, but indeed, does need to query at least one device to trigger this cache population.

Moved to uct_cuda_base_get_sys_dev() for now but since it adds if(){} check, maybe we could move it to ctor function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to have it here then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -1712,6 +1712,7 @@ typedef struct uct_allocated_memory {
ucs_memory_type_t mem_type; /**< type of allocated memory */
uct_md_h md; /**< if method==MD: MD used to allocate the memory */
uct_mem_h memh; /**< if method==MD: MD memory handle */
ucs_sys_device_t sys_dev; /**< System device for allocated memory */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, seems like we use sys_device more often in the APIs, maybe better to use it here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -90,6 +118,8 @@ uct_cuda_base_query_md_resources(uct_component_t *component,
continue;
}

sys_dev_to_device[sys_dev] = cuda_device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok to have it here then

};

template<class T>
void test_switch_cuda_device::detect_mem_type(ucs_memory_type_t mem_type) const
{
int num_devices;
ASSERT_EQ(cudaGetDeviceCount(&num_devices), cudaSuccess);
int current_device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant no need to move initialization of int current_device;

@tvegas1
Copy link
Contributor Author

tvegas1 commented Mar 17, 2025

Repro failures with container nvidia/cuda:11.8.0-devel-ubuntu20.04 and nvidia-utils-555, some comments remain to be addressed.

return UCS_OK;
}

status = UCT_CUDADRV_FUNC(cuDevicePrimaryCtxRetain(&primary_ctx, device),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rakhmets why is it considered an error?

@tvegas1
Copy link
Contributor Author

tvegas1 commented Mar 19, 2025

all comments addressed, skip inactive disabled for now to see if any failure.

Comment on lines 188 to 189
if ((mem_type != UCS_MEMORY_TYPE_HOST) ||
(sys_dev != UCS_SYS_DEVICE_ID_UNKNOWN)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add helper inline function, smth like
uct_mem_is_host_alloc(params)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 351 to 356
status = UCT_CUDADRV_FUNC(cuDevicePrimaryCtxGetState(device, &flags,
&active),
log_level);
if (status != UCS_OK) {
return status;
} else if (!active) {
} else if (!active && md->config.skip_inactive_primary_ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to rebase on Raul's PR for VMM detection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@yosefe
Copy link
Contributor

yosefe commented Mar 19, 2025

==== Running ucx_info env vars test ====
+ UCX_IB_PORTS=mlx5_0:1
+ ./src/tools/info/ucx_info -epw -u t
+ grep -q -E UCX_IB_PORTS
+ grep unused
[swx-rdmz-ucx-gpu-02:64825:0:64825]      ucp_mm.c:1756 Assertion `mem_info.type == alloc_mem_type' failed: mem_info.mem_type=host alloc_mem_type=cuda
==== backtrace (tid:  64825) ====
 0  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x12c) [0x7f216352e47c]
 1  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_fatal_error_message+0x50) [0x7f216352b4f0]

@tvegas1
Copy link
Contributor Author

tvegas1 commented Mar 19, 2025

==== Running ucx_info env vars test ====
+ UCX_IB_PORTS=mlx5_0:1
+ ./src/tools/info/ucx_info -epw -u t
+ grep -q -E UCX_IB_PORTS
+ grep unused
[swx-rdmz-ucx-gpu-02:64825:0:64825]      ucp_mm.c:1756 Assertion `mem_info.type == alloc_mem_type' failed: mem_info.mem_type=host alloc_mem_type=cuda
==== backtrace (tid:  64825) ====
 0  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x12c) [0x7f216352e47c]
 1  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_fatal_error_message+0x50) [0x7f216352b4f0]

happening because of retaining a non-active apparently, will investigate.

return UCS_OK;
}

ucs_status_t ucs_topo_sys_dev_set_user_value(ucs_sys_device_t sys_dev,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment on lines 727 to 732
size_t length = ucs_static_array_size(ucs_topo_global_ctx.devices);
ucs_topo_sys_device_info_t *device;

ucs_carray_for_each(device, ucs_topo_global_ctx.devices, length) {
device->user_value = UINTPTR_MAX;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed? we do not pre-initialize other fields

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

* Find system device by pci bus id.
*
* @param [in] bus_id pointer to bus id of the device of interest.
* @param [in] user_value user_value to add along
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param [in] user_value user_value to add along
* @param [in] user_value user_value to add along.

for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +68 to +69
return (mem_type == UCS_MEMORY_TYPE_HOST) &&
(sys_dev == UCS_SYS_DEVICE_ID_UNKNOWN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need it? I'd just ignore sys_dev if mem_type is host

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it's not strictly needed. @yosefe, i assume this is for future proof?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, IMO better keep it for strict behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is strict behavior? With this change we do not treat host memory as host anymore if sys_device is set. Then how should we handle such case? Imo, better to ignore sys_dev when memory type is host

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sys_device is set, we cannot do a host memory allocation because it would not match the system device requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can return invalid param then. What will be the behavior with this change?

@@ -74,7 +94,7 @@ uct_cuda_base_query_md_resources(uct_component_t *component,
int i, num_gpus;

status = UCT_CUDADRV_FUNC(cuDeviceGetCount(&num_gpus), UCS_LOG_LEVEL_DIAG);
if ((status != UCS_OK) || (num_gpus == 0)) {
if (status != UCS_OK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to remove check for num_gpus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed thanks!

Comment on lines +68 to +69
return (mem_type == UCS_MEMORY_TYPE_HOST) &&
(sys_dev == UCS_SYS_DEVICE_ID_UNKNOWN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, IMO better keep it for strict behavior

"Do not use CUDA primary context if not active when allocating memory.\n",
ucs_offsetof(uct_cuda_copy_md_config_t, skip_inactive_primary_ctx),
{"RETAIN_INACTIVE_PRIMARY_CTX", "y",
"Also use inactive CUDA primary context for memory allocation\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable retaining and using an inactive CUDA primary context for memory allocation
remote trailing \n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

return UCS_OK;
if (*cu_device != CU_DEVICE_INVALID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment before:
/* Use the active cuda device */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added line before if(){}

Comment on lines +213 to +349
for (auto device = 0; device < m_num_devices; ++device) {
ASSERT_EQ(cudaSetDevice(device), cudaSuccess);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pls add a comment to explain it?

@tvegas1
Copy link
Contributor Author

tvegas1 commented Mar 20, 2025

addressed comments and merged with common retain API, will push here when #10545 is in.

@@ -79,7 +79,8 @@ static ucs_config_field_t uct_cuda_copy_md_config_table[] = {
UCS_CONFIG_TYPE_ENUM(ucs_memory_type_names)},

{"RETAIN_INACTIVE_PRIMARY_CTX", "y",
"Also use inactive CUDA primary context for memory allocation\n",
"Enable retaining and using an inactive CUDA primary context for memory"
" allocation\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove trailing \n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (to review merge fix)

@@ -78,6 +78,12 @@ static ucs_config_field_t uct_cuda_copy_md_config_table[] = {
ucs_offsetof(uct_cuda_copy_md_config_t, cuda_async_mem_type),
UCS_CONFIG_TYPE_ENUM(ucs_memory_type_names)},

{"RETAIN_INACTIVE_PRIMARY_CTX", "n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set to no to see CI pass for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this variable to
RETAIN_INACTIVE_PRIMARY_CTX
retain_primary_ctx

to make code more compact
@rakhmets WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest renaming to retain_primary_ctx

*
* @return UCS_OK or error in case device cannot be found.
*/
ucs_status_t ucs_topo_find_device_by_bus_id_user_value(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe introduce void ucs_topo_sys_dev_set_user_value(ucs_sys_device_t sys_dev, uintptr_t user_value), and set it to UINTPTR_MAX in ucs_topo_find_device_by_bus_id.
Instead of having two similar methods ucs_topo_find_device_by_bus_id and ucs_topo_find_device_by_bus_id_user_value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started as you suggested and changed to address code reviews. @yosefe what do we do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

*device = user_value;
if (*device == CU_DEVICE_INVALID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an unexpected case.
user_value is set only if all previous methods related to the device have been successfully completed.
Maybe assert is better for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 82 to 83
"Enable retaining and using an inactive CUDA primary context for memory"
" allocation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enable seems to be redundant.
Retaining and use an inactive CUDA primary context for memory allocation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so: "Retain and use an inactive CUDA primary context for memory allocation"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -405,7 +522,9 @@ uct_cuda_copy_mem_alloc(uct_md_h uct_md, size_t *length_p, void **address_p,
*memh_p = alloc_handle;
*address_p = (void*)alloc_handle->ptr;
*length_p = alloc_handle->length;
return UCS_OK;
out:
uct_cuda_copy_pop_ctx(cu_device, alloc_cu_device, UCS_LOG_LEVEL_ERROR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARN should be ok here. Since UCX can continue running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 358 to 366
status = UCT_CUDADRV_FUNC(cuCtxPopCurrent(&ctx), log_level);
if (status != UCS_OK) {
return;
}

status = UCT_CUDADRV_FUNC(cuDevicePrimaryCtxRelease(device), log_level);
if (status != UCS_OK) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
status = UCT_CUDADRV_FUNC(cuCtxPopCurrent(&ctx), log_level);
if (status != UCS_OK) {
return;
}
status = UCT_CUDADRV_FUNC(cuDevicePrimaryCtxRelease(device), log_level);
if (status != UCS_OK) {
return;
}
UCT_CUDADRV_FUNC(cuCtxPopCurrent(NULL), log_level);
UCT_CUDADRV_FUNC(cuDevicePrimaryCtxRelease(device), log_level);

No need to exit if pop failed. We may warn user about the error, and it is better to try release the primary context on the device anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

ucs_status_t status;
int dev, num_devices;

status = UCT_CUDADRV_FUNC(cuCtxGetDevice(cu_device_p), UCS_LOG_LEVEL_DEBUG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
status = UCT_CUDADRV_FUNC(cuCtxGetDevice(cu_device_p), UCS_LOG_LEVEL_DEBUG);
status = UCT_CUDADRV_FUNC_LOG_DEBUG(cuCtxGetDevice(cu_device_p));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 415 to 428
status = UCT_CUDADRV_FUNC(cuDeviceGetCount(&num_devices),
UCS_LOG_LEVEL_DIAG);
if (status != UCS_OK) {
return UCS_ERR_INVALID_PARAM;
}

/* Use the first active cuda device for allocation */
for (dev = 0; dev < num_devices; dev++) {
if (UCT_CUDADRV_FUNC_LOG_DEBUG(cuDeviceGet(alloc_cu_device_p, dev)) !=
UCS_OK) {
continue;
}

status = uct_cuda_copy_push_ctx(md, *alloc_cu_device_p, log_level);
if (status == UCS_OK) {
break;
}
}

if (status != UCS_OK) {
ucs_log(log_level,
"no active cuda primary context for memory allocation");
}

return status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this combination of methods from #10564 work for you?

static ucs_status_t
uct_cuda_primary_ctx_push_first_active(CUdevice *cuda_device_p,
                                       CUcontext *cuda_ctx_p)
{
    int num_devices, device_num;
    ucs_status_t status;
    CUdevice device;
    CUcontext ctx;

    status = UCT_CUDADRV_FUNC_LOG_ERR(cuDeviceGetCount(&num_devices));
    if (status != UCS_OK) {
        return status;
    }

    for (device_num = 0; device_num < num_devices; ++device_num) {
        status = UCT_CUDADRV_FUNC_LOG_ERR(cuDeviceGet(&device, device_num));
        if (status != UCS_OK) {
            return status;
        }

        status = uct_cuda_primary_ctx_retain(device, 0, &ctx);
        if (status == UCS_OK) {
            break;
        } else if (status == UCS_ERR_NO_DEVICE) {
            continue;
        } else {
            return status;
        }
    }

    if (device_num == num_devices) {
        return UCS_ERR_NO_DEVICE;
    }

    status = UCT_CUDADRV_FUNC_LOG_ERR(cuCtxPushCurrent(ctx));
    if (status != UCS_OK) {
        UCT_CUDADRV_FUNC_LOG_WARN(cuDevicePrimaryCtxRelease(device));
        return status;
    }

    *cuda_device_p = device;
    *cuda_ctx_p    = ctx;
    return UCS_OK;
}

static UCS_F_ALWAYS_INLINE void
uct_cuda_primary_ctx_pop_and_release(CUdevice cuda_device)
{
    if (ucs_likely(cuda_device == CU_DEVICE_INVALID)) {
        return;
    }

    UCT_CUDADRV_FUNC_LOG_WARN(cuCtxPopCurrent(NULL));
    UCT_CUDADRV_FUNC_LOG_WARN(cuDevicePrimaryCtxRelease(cuda_device));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes agree, maybe we can merge this one and #10564 could extract the block? I understand this one might block other PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep both independent and after merge we unite the code

Comment on lines +68 to +69
return (mem_type == UCS_MEMORY_TYPE_HOST) &&
(sys_dev == UCS_SYS_DEVICE_ID_UNKNOWN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sys_device is set, we cannot do a host memory allocation because it would not match the system device requirement.

Comment on lines 82 to 83
"Enable retaining and using an inactive CUDA primary context for memory"
" allocation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so: "Retain and use an inactive CUDA primary context for memory allocation"

Comment on lines +388 to +400
UCS_TEST_P(test_mem_alloc_device, different_device_cuda_fabric,
"CUDA_COPY_ENABLE_FABRIC=y")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add CUDA_COPY_ENABLE_FABRIC=y as a test variant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did not find existing infra in uct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see

enum_resources(const std::string &tl_name)

Comment on lines 518 to 520
(void)UCT_CUDADRV_FUNC(cuCtxPopCurrent(&ctx), UCS_LOG_LEVEL_WARN);
(void)UCT_CUDADRV_FUNC(cuDevicePrimaryCtxRelease(alloc_cu_device),
UCS_LOG_LEVEL_WARN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it was better when these 2 calls were in a helper function uct_cuda_copy_pop_ctx because it is the opposite of uct_cuda_copy_push_ctx
then reuse uct_cuda_copy_pop_ctx also in uct_cuda_copy_md_sync_memops_get_address_range for consistency

Comment on lines 670 to 680
if (uct_cuda_copy_push_ctx(cuda_device, 0, UCS_LOG_LEVEL_ERROR) !=
UCS_OK) {
return;
}
} else {
cuda_mem_ctx = cuda_ctx;
}

if (UCT_CUDADRV_FUNC_LOG_ERR(cuCtxPushCurrent(cuda_mem_ctx)) != UCS_OK) {
goto out_primary_ctx_release;
} else if (UCT_CUDADRV_FUNC_LOG_ERR(cuCtxPushCurrent(cuda_ctx)) !=
UCS_OK) {
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (cuda_ctx == NULL) {
    status = uct_cuda_copy_push_ctx(cuda_device, 0, UCS_LOG_LEVEL_ERROR);
} else {
    status = UCT_CUDADRV_FUNC_LOG_ERR(cuCtxPushCurrent(cuda_ctx));
}
if (status !=UCS_OK) {        
    return;
}

Comment on lines +388 to +400
UCS_TEST_P(test_mem_alloc_device, different_device_cuda_fabric,
"CUDA_COPY_ENABLE_FABRIC=y")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see

enum_resources(const std::string &tl_name)

@@ -78,6 +78,12 @@ static ucs_config_field_t uct_cuda_copy_md_config_table[] = {
ucs_offsetof(uct_cuda_copy_md_config_t, cuda_async_mem_type),
UCS_CONFIG_TYPE_ENUM(ucs_memory_type_names)},

{"RETAIN_INACTIVE_PRIMARY_CTX", "n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest renaming to retain_primary_ctx

brminich
brminich previously approved these changes Mar 23, 2025
{
CUcontext ctx;

(void)UCT_CUDADRV_FUNC(cuCtxPopCurrent(&ctx), UCS_LOG_LEVEL_WARN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, can pass NULL instead of ctx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants