Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Debug leaked semaphore #6578

Closed
wants to merge 24 commits into from

Conversation

tohtana
Copy link
Contributor

@tohtana tohtana commented Sep 27, 2024

Our CI tests frequently get stuck after showing the error below. You can find an example at here.

/opt/conda/envs/ptca/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

As this doesn't happen on my local env, I added logging to see what happens on the CI env.

@tohtana tohtana mentioned this pull request Oct 12, 2024
github-merge-queue bot pushed a commit that referenced this pull request Oct 14, 2024
Tests with `reuse_dist_env = True` often causes memory leaks. This PR
ignores `reuse_dist_env` and forcibly sets it to `False`. This change
might slow down the tests, but I think it is better to manually restart
runners and relaunch tests.

Memory usages (See #6578):
- `reuse_dist_env == True`:
https://github.com/microsoft/DeepSpeed/actions/runs/11302940871/job/31439471512
- `reuse_dist_env == False`:
https://github.com/microsoft/DeepSpeed/actions/runs/11303250613/job/31440137894
@tohtana
Copy link
Contributor Author

tohtana commented Oct 15, 2024

I think the cause of the memory issue in tests has been identified as reuse_dist_env and it was disabled by #6623. We still see tests sometimes stuck and opened #6627 to track it down.

@tohtana tohtana closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant