Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmanaged memory is high and frozen execution #295

Open
pappagari opened this issue Oct 11, 2024 · 3 comments
Open

Unmanaged memory is high and frozen execution #295

pappagari opened this issue Oct 11, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@pappagari
Copy link

pappagari commented Oct 11, 2024

Describe the bug

The warning

2024-10-11 00:04:31,529 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 
16.47 GiB -- Worker memory limit: 23.34 GiB 

Even though it is just a warning, the execution freezes after this. I am running tinystories tutorial on 8 cpu workers. This happens after the clean_and_unify step of tinystories tutorial.
After freezing, I checked top and it still shows 8 active processes

Steps/Code to reproduce bug

I am trying the tinystories tutorial on the c4 realnewslike dataset.

Download the dataset as follows (obtained from https://huggingface.co/datasets/allenai/c4)

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "realnewslike/*"

The dataset is of size 37G. It contains 513 files each with 26953 entries. I don't have issues running this tutorial on the smaller version of the dataset (2G). Hence I think the warning is likely because of handling large datasets

Expected behavior

Expected it to finish the exection and write the processed data.

Environment overview (please complete the following information)

OS version -- Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64)
Python version -- 3.10.15
pip version -- 24.2
dask version -- 2024.7.1
dask_cuda version -- 24.08.02

@pappagari pappagari added the bug Something isn't working label Oct 11, 2024
@ayushdg
Copy link
Collaborator

ayushdg commented Oct 14, 2024

Thanks for raising the issue.
cc: @Maghoumi , @ryantwolf

@Maghoumi
Copy link
Collaborator

This might be related to another issue we recently investigated where the memory usage went extremely high with 8 workers, but not with 4 workers. Ryan suspected some change on the RAPIDS side may have contributed to it.

@ayushdg
Copy link
Collaborator

ayushdg commented Oct 14, 2024

This might be related to another issue we recently investigated where the memory usage went extremely high with 8 workers, but not with 4 workers. Ryan suspected some change on the RAPIDS side may have contributed to it.

Thanks. Given that the OOM's / hangs being discussed here are around CPU modules it seems unlikely that a Rapids change might have impacted results here. In either case @pappagari if you could try the same out with fewer number of workers it would be interesting to see if that works for your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants