Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

[bug] Dask worker dies during dask-xgboost classifier training : test_core.py::test_classifier #68

Open
pradghos opened this issue Jan 30, 2020 · 5 comments

Comments

@pradghos
Copy link

Dask worker dies while during dask-xgboost classifier training ; It is being observed while running test_core.py::test_classifier

Configuration used -

Dask Version: 2.9.2
Distributed Version: 2.9.3
XGBoost Version: 0.90
Dask-XGBoost Version: 0.1.9
OS-release : 4.14.0-115.16.1.el7a.ppc64le

Description / Steps - :-

  1. Test create two cluster -
> /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/test_core.py(38)test_classifier()
-> with cluster() as (s, [a, b]):
(Pdb) n
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:45767
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:40743
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:40743
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-c6ea91c7-746e-4c7a-9c13-f5afcd244966/worker-ebbqtfdu
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:33373
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:33373
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-050815d2-54f6-4edc-9a03-dd075213449d/worker-i1yr8xvc
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:40743
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:33373', name: tcp://127.0.0.1:33373, memory: 0, processing: 0>
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:33373
distributed.core - INFO - Starting established connection
distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45767
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

  1. After couple of steps - fit is being called for dask-xgboost -
-> a.fit(X2, y2)
(Pdb) distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
ndistributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373

distributed.worker - DEBUG - Execute key: array-original-8d35e675b41aad38dc334c7f79ea1982 worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: array-original-8d35e675b41aad38dc334c7f79ea1982, {'op': 'task-finished', 'status': 'OK', 'nbytes': 80, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2651937, 'stop': 1580372953.265216, 'thread': 140735736705456, 'key': 'array-original-8d35e675b41aad38dc334c7f79ea1982'}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2696354, 'stop': 1580372953.2696435, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 0)"}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2705007, 'stop': 1580372953.2705073, 'thread': 140735736705456, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2753158, 'stop': 1580372953.275466, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2762377, 'stop': 1580372953.2763371, 'thread': 140735736705456, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2805014, 'stop': 1580372953.2805073, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580372953.2813187, 'stop': 1580372953.2813244, 'thread': 140735736705456, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys

Dask worker dies -

distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:40743
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:40743', name: tcp://127.0.0.1:40743, memory: 1, processing: 1>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:40743     ===========================>>> One worker dies 
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
distributed.worker - DEBUG - Execute key: train_part-e17e49e3769aaa4870dc8cc01a1e015e worker: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING   ===  One worker is running infinitely 
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - future state: train_part-e17e49e3769aaa4870dc8cc01a1e015e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33373

It is not clear why does dask worker die at that point .

Thanks!
Pradipta

@pradghos pradghos changed the title [Bug] Dask worker dies while during dask-xgboost classifier training : test_core.py::test_classifier [bug] Dask worker dies while during dask-xgboost classifier training : test_core.py::test_classifier Jan 30, 2020
@pradghos
Copy link
Author

If I remove sparse package coming from conda-forge from my environment - Dask worker is working fine and able to finish the task instead of dying -

removing sparse conda package -

conda remove sparse
Collecting package metadata (repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.1

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /mnt/pai/home/pradghos/anaconda3/envs/gdf37

  removed specs:
    - sparse


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py37_0         156 KB
    ------------------------------------------------------------
                                           Total:         156 KB

The following packages will be REMOVED:

  llvmlite-0.31.0-py37hd408876_0
  numba-0.47.0-py37h962f231_0
  sparse-0.9.1-py_0
  tbb-2019.9-h1bb5118_1

The following packages will be UPDATED:

  openssl            conda-forge::openssl-1.1.1d-h6eb9509_0 --> pkgs/main::openssl-1.1.1d-h7b6447c_3

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi                                       conda-forge --> pkgs/main


Proceed ([y]/n)? y


Downloading and Extracting Packages
certifi-2019.11.28   | 156 KB    | ################################################################################################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Then success log -

(gdf37) [pradghos@dlw11 tests]$ pytest --trace -v test_core.py::test_classifier
==================================================================== test session starts =====================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.1, pluggy-0.13.1 -- /mnt/pai/home/pradghos/anaconda3/envs/gdf37/bin/python
cachedir: .pytest_cache
rootdir: /mnt/pai/home/pradghos/dask-xgboost, inifile: setup.cfg
collected 1 item

test_core.py::test_classifier
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB runcall (IO-capturing turned off) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/test_core.py(38)test_classifier()
-> with cluster() as (s, [a, b]):
(Pdb) n
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:46179
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:34459
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:34459
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:46179
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-071bff45-4e7d-4cb5-ae4d-5d77ec15ef20/worker-ozjlqw1m
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:33495
distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:33495
distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:46179
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                  612.37 GB
distributed.worker - INFO -       Local Directory: /mnt/pai/home/pradghos/dask-xgboost/dask_xgboost/tests/_test_worker-f8007b01-22e5-4a6e-b100-a4efbade1d80/worker-cib_tomi
distributed.worker - INFO - -------------------------------------------------

fit log -

-> a.fit(X2, y2)
(Pdb) distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
n
distributed.worker - DEBUG - Execute key: array-original-8d35e675b41aad38dc334c7f79ea1982 worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: array-original-8d35e675b41aad38dc334c7f79ea1982, {'op': 'task-finished', 'status': 'OK', 'nbytes': 80, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0739253, 'stop': 1580374654.0739493, 'thread': 140735091896752, 'key': 'array-original-8d35e675b41aad38dc334c7f79ea1982'}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 0) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0785978, 'stop': 1580374654.078607, 'thread': 140735091896752, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 0)"}
distributed.worker - DEBUG - Execute key: ('array-8d35e675b41aad38dc334c7f79ea1982', 1) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('array-8d35e675b41aad38dc334c7f79ea1982', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 40, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0801446, 'stop': 1580374654.080152, 'thread': 140735091896752, 'key': "('array-8d35e675b41aad38dc334c7f79ea1982', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0842004, 'stop': 1580374654.0843685, 'thread': 140735091896752, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Execute key: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0857737, 'stop': 1580374654.0858817, 'thread': 140735091896752, 'key': "('unique_internal-getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 0), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.088994, 'stop': 1580374654.089002, 'thread': 140735091896752, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 0)"}
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Execute key: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1) worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Send compute response to scheduler: ('getitem-a6b7823aa95705e499984f972c2b58b3', 1), {'op': 'task-finished', 'status': 'OK', 'nbytes': 16, 'type': <class 'numpy.ndarray'>, 'start': 1580374654.0944228, 'stop': 1580374654.0944307, 'thread': 140735091896752, 'key': "('getitem-a6b7823aa95705e499984f972c2b58b3', 1)"}
...
...
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - DEBUG - Heartbeat skipped: channel busy
distributed.worker - INFO - Run out-of-band function 'start_tracker'
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
distributed.worker - DEBUG - Deleted 1 keys
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
/mnt/pai/home/pradghos/anaconda3/envs/gdf37/lib/python3.7/site-packages/dask/dataframe/_compat.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm  # noqa: F401
distributed.worker - DEBUG - Execute key: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 worker: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Execute key: train_part-140acf4f99cbae1677f5d995d3ac0e1e worker: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
[02:57:35] WARNING: /opt/anaconda/conda-bld/xgboost-base_1579835034723/work/src/learner.cc:622: Tree method is automatically selected to be 'approx' for distributed training.[02:57:35] WARNING: /opt/anaconda/conda-bld/xgboost-base_1579835034723/work/src/learner.cc:622: Tree method is automatically selected to be 'approx' for distributed training.

[02:57:35] Tree method is automatically selected to be 'approx' for distributed training.[02:57:35] Tree method is automatically selected to be 'approx' for distributed training.

distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:34459
distributed.worker - DEBUG - Heartbeat: tcp://127.0.0.1:33495
distributed.worker - DEBUG - future state: train_part-7f615b0486ae0a04c5aadb6e5c529bb8 - RUNNING
distributed.worker - DEBUG - future state: train_part-140acf4f99cbae1677f5d995d3ac0e1e - RUNNING

train_part part of XGBoost distributed workload runs fine in both the worker.

@pradghos
Copy link
Author

Any pointer if sparse package coming conda-forge is not compatible with dask-xgboost or xgboost or what could be the reason behind dask worker's dying ? It would really help !

@pradghos pradghos changed the title [bug] Dask worker dies while during dask-xgboost classifier training : test_core.py::test_classifier [bug] Dask worker dies during dask-xgboost classifier training : test_core.py::test_classifier Jan 30, 2020
@TomAugspurger
Copy link
Member

TomAugspurger commented Jan 30, 2020 via email

@pradghos
Copy link
Author

It is easily reproducible with the test case ;
we have in dask-xgboost test case - pytest -v test_core.py::test_classifier

Code snippet :-

def test_classifier(loop):  # noqa
    with cluster() as (s, [a, b]):
        with Client(s["address"], loop=loop):
            a = dxgb.XGBClassifier()
            X2 = da.from_array(X, 5)
            y2 = da.from_array(y, 5)
            a.fit(X2, y2)   ====> It hanging here. 
            p1 = a.predict(X2)

    b = xgb.XGBClassifier()
    b.fit(X, y)
    np.testing.assert_array_almost_equal(
        a.feature_importances_, b.feature_importances_
    )
    assert_eq(p1, b.predict(X))

As I have mentioned earlier whenever we have sparse conda package ; Hang is being observed - because one dask worker died.

Please let me know if you need any other information.

Thanks!

@TomAugspurger
Copy link
Member

You also have scipy installed?

The sparse code is fairly self contained, just https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py#L17-L22 and https://github.com/dask/dask-xgboost/blob/master/dask_xgboost/core.py#L50-L63. Are you able to step through those and see where things go wrong?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants