-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saga pantry attach/detach: fail if pantry is removed from DNS #6866
base: main
Are you sure you want to change the base?
Conversation
Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.
This builds on #6866. After its changes, there was only caller of `retry_until_known_result` left; this PR removes it. We keep the retry loop, but instead of retrying for ever, we bail out if the sled we're trying to reach is "gone", as determined by "is it no longer in-service", which in practice means it's been expunged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to come back to your testing question, but the code looks correct to me
// Importantly, _do not use `call_pantry_attach_for_disk`_! That retries | ||
// as long as `pantry_address` is still resolvable in DNS, which we _do not | ||
// want here_. The Pantry attach can fail if there's a racing Volume | ||
// checkout to be sent to Propolis. Additionally, that call uses `attach` | ||
// instead of `attach_activate_background`, which means it will hang on the | ||
// activation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would update this message to only
Importantly, _do not use `call_pantry_attach_for_disk`_! That call uses `attach` instead of `attach_activate_background`, which means it will hang on the activation.
At some point, some sort of retry should be added here, but that shouldn't block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with this code so I'm not sure how much confidence my +1 adds but it looks good to me!
Ok(entries) => entries, | ||
Err(err) => { | ||
warn!( | ||
log, "Failed to resolve Crucible pantry in DNS"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log, "Failed to resolve Crucible pantry in DNS"; | |
log, "Failed to query DNS for Crucible pantry"; |
super nitty but sometimes "failed to resolve X in DNS" means "I successfully reached DNS and X wasn't there", which isn't the case here
Removal from DNS should correspond with the pantry zone's expungement, which should result in a saga node failure.
I'm marking this as a draft because it's completely untested. @jmpesp any suggestions for testing? I skipped over the Nexus integration tests for a test to expand or emulate and didn't see anything in this area, but definitely could've missed it (they are legion!). I could construct a local unit test of the attach/detach functions, but it would require a fair bit of setup machinery (a fake crucible pantry, modifying the DNS records, ...). Maybe worth it if there's no easy way to test this?