Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make 2-proc cuda distributed remap soft-fail #2129

Merged
merged 1 commit into from
Jan 15, 2025

Conversation

charleskawczynski
Copy link
Member

This PR makes the distributed remapping with CUDA (2 processes) job soft-fail for now, since it's failing pretty consistently.

@Sbozzolo
Copy link
Member

I worry that this will just hide the problem and issue will not be addressed.

In the past, this job showed no signs of flakiness, so we must have introduced something new relatively recently.

Maybe before making this a soft fail, we can try spending two hours to see if we can reproduce the problem reliably. If in two hours of work, we cannot find come up with a reproducer, we can turn this into a soft fail.

@charleskawczynski
Copy link
Member Author

I worry that this will just hide the problem and issue will not be addressed.

In the past, this job showed no signs of flakiness, so we must have introduced something new relatively recently.

I understand, but this has been an issue as far back as December 20, 2024 (#2108). This is why I opened the issue, to make sure that we don't lose track.

Maybe before making this a soft fail, we can try spending two hours to see if we can reproduce the problem reliably. If in two hours of work, we cannot find come up with a reproducer, we can turn this into a soft fail.

In this build: https://buildkite.com/clima/climacore-ci/builds/4912, I retried the same test multiple times, and it's clearly a race condition since the number of tests that passed were different on different attempts. So, it's not clear to me that we can make a reproducer aside from running the tests multiple times. I suppose we could try putting it inside a loop and call that a reproducer.

I'm happy to take a look over it, my thought was that we could merge this in the mean time, since it kind of clouds the overall CI status.

@charleskawczynski charleskawczynski merged commit 988bda5 into main Jan 15, 2025
35 checks passed
@charleskawczynski charleskawczynski deleted the ck/dist_remap_soft_fail branch January 15, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants