You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have had multiple breakages of CUDA context being only used for GPU 0 in a dask+pytorch environment. Sometimes this can occur due to a library creating a cuda context with pytorch before starting the cluster.
What ends up happening is Pytorch models being deployed on GPU-0 and that issue is hard to debug .
The text was updated successfully, but these errors were encountered:
I think a better fix is ensuring we dont fork context if its all ready present for local cuda cluster.
importcupyascpcp.cuda.runtime.getDeviceCount()
# import torch# t = totch.as_tensor([1,2,3])fromdask_cudaimportLocalCUDAClusterfromdistributedimportClientfromdistributed.diagnostics.nvmlimporthas_cuda_contextimporttimedefcheck_cuda_context():
_warning_suffix= (
"This is often the result of a CUDA-enabled library calling a CUDA runtime function before ""Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen ""at import time or in the global scope of a program."
)
ifhas_cuda_context().has_context:
# If no error was raised, the CUDA context is initializedraiseRuntimeError(
f"CUDA context is initialized before the dask-cuda cluster was spun up. {_warning_suffix}"
)
if__name__=="__main__":
check_cuda_context()
cluster=LocalCUDACluster(rmm_async=True, rmm_pool_size="2GiB")
client=Client(cluster)
Describe the bug
We have had multiple breakages of CUDA context being only used for GPU 0 in a dask+pytorch environment. Sometimes this can occur due to a library creating a cuda context with pytorch before starting the cluster.
What ends up happening is Pytorch models being deployed on GPU-0 and that issue is hard to debug .
The text was updated successfully, but these errors were encountered: