Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saga pantry attach/detach: fail if pantry is removed from DNS #6866

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.

I'm marking this as a draft because it's completely untested. @jmpesp any suggestions for testing? I skipped over the Nexus integration tests for a test to expand or emulate and didn't see anything in this area, but definitely could've missed it (they are legion!). I could construct a local unit test of the attach/detach functions, but it would require a fair bit of setup machinery (a fake crucible pantry, modifying the DNS records, ...). Maybe worth it if there's no easy way to test this?

Removal from DNS should correspond with the pantry zone's expungement,
which should result in a saga node failure.
jgallagher added a commit that referenced this pull request Oct 14, 2024
This builds on #6866. After its changes, there was only caller of
`retry_until_known_result` left; this PR removes it. We keep the retry
loop, but instead of retrying for ever, we bail out if the sled we're
trying to reach is "gone", as determined by "is it no longer
in-service", which in practice means it's been expunged.
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to come back to your testing question, but the code looks correct to me

Comment on lines +1287 to +1292
// Importantly, _do not use `call_pantry_attach_for_disk`_! That retries
// as long as `pantry_address` is still resolvable in DNS, which we _do not
// want here_. The Pantry attach can fail if there's a racing Volume
// checkout to be sent to Propolis. Additionally, that call uses `attach`
// instead of `attach_activate_background`, which means it will hang on the
// activation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would update this message to only

Importantly, _do not use `call_pantry_attach_for_disk`_! That call uses `attach` instead of `attach_activate_background`, which means it will hang on the activation.

At some point, some sort of retry should be added here, but that shouldn't block this PR.

@davepacheco davepacheco added this to the 12 milestone Oct 15, 2024
Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with this code so I'm not sure how much confidence my +1 adds but it looks good to me!

Ok(entries) => entries,
Err(err) => {
warn!(
log, "Failed to resolve Crucible pantry in DNS";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log, "Failed to resolve Crucible pantry in DNS";
log, "Failed to query DNS for Crucible pantry";

super nitty but sometimes "failed to resolve X in DNS" means "I successfully reached DNS and X wasn't there", which isn't the case here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants