Unmanaged memory is high and frozen execution #295

pappagari · 2024-10-11T17:30:32Z

Describe the bug

The warning

2024-10-11 00:04:31,529 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 
16.47 GiB -- Worker memory limit: 23.34 GiB

Even though it is just a warning, the execution freezes after this. I am running tinystories tutorial on 8 cpu workers. This happens after the clean_and_unify step of tinystories tutorial.
After freezing, I checked top and it still shows 8 active processes

Steps/Code to reproduce bug

I am trying the tinystories tutorial on the c4 realnewslike dataset.

Download the dataset as follows (obtained from https://huggingface.co/datasets/allenai/c4)

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "realnewslike/*"

The dataset is of size 37G. It contains 513 files each with 26953 entries. I don't have issues running this tutorial on the smaller version of the dataset (2G). Hence I think the warning is likely because of handling large datasets

Expected behavior

Expected it to finish the exection and write the processed data.

Environment overview (please complete the following information)

OS version -- Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64)
Python version -- 3.10.15
pip version -- 24.2
dask version -- 2024.7.1
dask_cuda version -- 24.08.02

The text was updated successfully, but these errors were encountered:

ayushdg · 2024-10-14T17:12:18Z

Thanks for raising the issue.
cc: @Maghoumi , @ryantwolf

Maghoumi · 2024-10-14T21:43:17Z

This might be related to another issue we recently investigated where the memory usage went extremely high with 8 workers, but not with 4 workers. Ryan suspected some change on the RAPIDS side may have contributed to it.

ayushdg · 2024-10-14T22:10:39Z

This might be related to another issue we recently investigated where the memory usage went extremely high with 8 workers, but not with 4 workers. Ryan suspected some change on the RAPIDS side may have contributed to it.

Thanks. Given that the OOM's / hangs being discussed here are around CPU modules it seems unlikely that a Rapids change might have impacted results here. In either case @pappagari if you could try the same out with fewer number of workers it would be interesting to see if that works for your use case.

pappagari added the bug Something isn't working label Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unmanaged memory is high and frozen execution #295

Unmanaged memory is high and frozen execution #295

pappagari commented Oct 11, 2024 •

edited

Loading

ayushdg commented Oct 14, 2024

Maghoumi commented Oct 14, 2024

ayushdg commented Oct 14, 2024

Unmanaged memory is high and frozen execution #295

Unmanaged memory is high and frozen execution #295

Comments

pappagari commented Oct 11, 2024 • edited Loading

ayushdg commented Oct 14, 2024

Maghoumi commented Oct 14, 2024

ayushdg commented Oct 14, 2024

pappagari commented Oct 11, 2024 •

edited

Loading