Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing arm_worker error during DAG failure in a UCX enabled Dask-CUDA cluster #832

Open
randerzander opened this issue Jan 20, 2022 · 0 comments

Comments

@randerzander
Copy link

With a single node dask-cuda cluster configured to use UCX (nvlink enabled, IB disabled), when a Dask DAG fails (likely due to OOM), the error message I receive is:

sys:1: RuntimeWarning: coroutine 'BlockingMode._arm_worker' was never awaited                                         
RuntimeWarning: Enable tracemalloc to get the object allocation traceback                                             
Task was destroyed but it is pending!                                                                                 
task: <Task cancelling name='Task-412494' coro=<BlockingMode._arm_worker() running at /home/rgelhausen/conda/envs/dsql-1-20/lib/python3.8/site-packages/ucp/continuous_ucx_progress.py:88>>

Env details:

(dsql-1-20) rgelhausen@rl-dgx2-r13-u7-rapids-dgx201:~/shared/gpu-bdb/gpu_bdb/cluster_configuration$ conda list | grep ucx
ucx                       1.12.0+gd367332      cuda11.2_0    rapidsai-nightly
ucx-proc                  1.0.0                       gpu    rapidsai-nightly
ucx-py                    0.24.0a220120   py38_gd367332_26    rapidsai-nightly
(dsql-1-20) rgelhausen@rl-dgx2-r13-u7-rapids-dgx201:~/shared/gpu-bdb/gpu_bdb/cluster_configuration$ conda list | grep dask
dask                      2022.1.0+10.gc1c88f06          pypi_0    pypi
dask-cudf                 22.2.0a0+300.g12a0f596e5          pypi_0    pypi
dask-glm                  0.2.0                    pypi_0    pypi
dask-labextension         5.2.0              pyhd8ed1ab_0    conda-forge
dask-ml                   2021.11.31.dev2+g1e811ce4          pypi_0    pypi
dask-sql                  2021.12.1.dev34+g736f264          pypi_0    pypi
@caryr35 caryr35 added this to ucx-py Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant