Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinterpolate to all elements for new target points (fix BBH deadlocks) #5531

Merged
merged 1 commit into from
Oct 12, 2023

Conversation

knelli2
Copy link
Contributor

@knelli2 knelli2 commented Oct 3, 2023

Proposed changes

This is believed to be what was causing a lot of deadlocks in BBH simulations. Between horizon find iterations, a target point of the new horizon surface would move from one element on one core, to a different element on a second core. Because of some implicit assumption made about message ordering (which wasn't consciously made, but just overlooked), this target point never got interpolated to. Thus the target was waiting for the interpolated data from this point while the interpolator was waiting for its next set of points and a deadlock happened.

Fixes #5487.

Upgrade instructions

Code review checklist

  • The code is documented and the documentation renders correctly. Run
    make doc to generate the documentation locally into BUILD_DIR/docs/html.
    Then open index.html.
  • The code follows the stylistic and code quality guidelines listed in the
    code review guide.
  • The PR lists upgrade instructions and is labeled bugfix or
    new feature if appropriate.

Further comments

@knelli2 knelli2 added priority critical for progress bugfix labels Oct 3, 2023
@knelli2 knelli2 requested a review from markscheel October 3, 2023 17:39
// elements that contained any of the previous target points. However,
// these elements may now contain some new target points.
if (vars_infos.count(temporal_id) == 0) {
vars_infos.emplace(std::make_pair(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vars_infos.insert_or_assign? (Haven't checked exactly what type this is, but most similar classes have it.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm let me think if we need to keep around the other members of the Info class when we receive a new set of points. I'm thinking no.

Comment on lines 83 to 89
// Add the new target interpolation points at this temporal_id. If we
// already have some target points at this temporal_id, we overwrite
// them with the new target points. This is because the interpolation
// has already finished for the previous set of target points if we
// are receiving new target points. This core must not have had any
// elements that contained any of the previous target points. However,
// these elements may now contain some new target points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this happen in the opposite order? We receive the second set of points (which need interpolation) first, and then overwrite them with the first set (which didn't)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe that could technically happen in the opposite order. I wonder if we should somehow send along the horizon find iteration so we know which set of points is the correct one (always the higher iteration)

@knelli2 knelli2 changed the title Reinterpolate to all elements when interpolator receives new points (fix BBH deadlocks) Reinterpolate to all elements for new target points (fix BBH deadlocks) Oct 3, 2023
@knelli2 knelli2 force-pushed the fix_intrp_deadlock branch from ec07609 to 42fca67 Compare October 4, 2023 18:39
@knelli2
Copy link
Contributor Author

knelli2 commented Oct 4, 2023

@wthrowe Pushed a fixup that covers the scenario of receiving points in the wrong order.

@wthrowe
Copy link
Member

wthrowe commented Oct 4, 2023

That should do it. Go ahead and squash.

@knelli2 knelli2 force-pushed the fix_intrp_deadlock branch from 42fca67 to a8af8b0 Compare October 4, 2023 20:29
Comment on lines +112 to +117
if (vars_infos.count(temporal_id) == 0 or
vars_infos.at(temporal_id).iteration < iteration) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it error if the new iteration is the same as the old iteration. This is a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wthrowe Pushed a fixup for this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@knelli2 knelli2 force-pushed the fix_intrp_deadlock branch 3 times, most recently from 68928c3 to e0db394 Compare October 5, 2023 23:12
@knelli2
Copy link
Contributor Author

knelli2 commented Oct 5, 2023

@wthrowe Squashed

@AlexCarpenter46
Copy link
Contributor

I've been testing this branch for a few different head on BBH's (different resolutions, enabling checkpoints and quiet/debug output) and they've all successfully reached a common horizon :). So I would say this has fixed the current deadlocks we've been encountering. These runs are on Ocean at /home/alexcarpenter/BBH_Headon/BBH_develop_with_fix

@nilsvu
Copy link
Member

nilsvu commented Oct 7, 2023

Hoorayyy 🙌

This is believed to be what was causing a lot of deadlocks in BBH
simulations.
@knelli2 knelli2 force-pushed the fix_intrp_deadlock branch from e0db394 to 62a7326 Compare October 11, 2023 17:08
@nilsdeppe
Copy link
Member

@wthrowe can you please take a look at this again?

@wthrowe
Copy link
Member

wthrowe commented Oct 12, 2023

It's waiting on @markscheel?

@nilsdeppe
Copy link
Member

@wthrowe Squashed

?

@knelli2 knelli2 removed the request for review from markscheel October 12, 2023 03:21
@knelli2
Copy link
Contributor Author

knelli2 commented Oct 12, 2023

I removed Mark because this is a pretty important bug fix that I think we should get in quickly.

@wthrowe
Copy link
Member

wthrowe commented Oct 12, 2023

If the PR author requests a review, I'm going to respect that.

@wthrowe wthrowe merged commit e982765 into sxs-collaboration:develop Oct 12, 2023
21 of 22 checks passed
@nilsdeppe
Copy link
Member

If the PR author requests a review, I'm going to respect that.

That doesn't mean you can't give feedback post-squash, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix priority critical for progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix BBH evolution deadlocks
5 participants