Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warnings and docs for affinity set failure #1420

Merged
merged 2 commits into from
Dec 20, 2024

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Dec 19, 2024

When PyNVML fails to identify CPU affinity appropriately, it may cause an error with launching Dask-CUDA. After extensive discussions in #1381, it seems appropriate to allow continuing if CPU affinity identification fails and print a warning with a link to documentation instead. New documentation is also added to help in first steps of troubleshooting.

Unfortunately testing warnings in Distributed plugins seems very hard to do, I couldn't find a way to do that even with distributed.utils_tests.captured_logger, which runs only after the cluster is created with a LocalCluster (or LocalCUDACluster). For the dask cuda worker CLI there's no way for us to mock the value passed to CPUAffinity to force a warning to be raised, so no tests are added at this time.

Closes #1381 .

@pentschev pentschev requested a review from a team as a code owner December 19, 2024 22:18
@github-actions github-actions bot added the python python code needed label Dec 19, 2024
@pentschev pentschev added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Dec 19, 2024
@quasiben
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit fd8a736 into rapidsai:branch-25.02 Dec 20, 2024
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DASK Deployment using SLURM with GPUs
2 participants