-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#3886 3/4] Start saga for region replacement #5839
Conversation
Start filling out the region replacement background task skeleton by adding a check for regions that are stored on disks that were expunged, and inserting region replacement requests for them. In the same background task, check if there are any requests in the "Requested" state and running the new "region replacement start" saga for them. This background task will also pick up manually requested region replacements. Also in this commit is the region replacement start saga, which will take a replacement request that is in state "Requested", and perform the steps to: - allocate a new region - swap the region being replaced with that newly allocated region in the affected volume - create a fake volume that can later be deleted, referencing the region that was swapped out - update the region replacement request's state to "Running" and clearing the operating saga id This represents the first step to be taken after a region goes away: allocate the replacement, and swap it in to the affected volume. Once this is done, any new checkout and construction of the affected volume will no longer reference the expunged region but will cause reconciliation to take place. It is still degraded in the sense that the newly allocated region is blank and the other two are not, and an Upstairs needs to perform that reconciliation or repair. Existing constructed Volumes running in a propolis or pantry context will remain unmodified: the next commmit will be a saga that takes care of initiating live repair or reconciliation for those existing running Volumes in order to drive either live repair or reconciliation forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions I have to help me understand
} | ||
|
||
// Skip looking at read-only parent, this function only looks | ||
// for R/W regions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for now...
}; | ||
|
||
let old_volume = if let Some(old_volume) = maybe_old_volume { | ||
old_volume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason you don't do the
let old_volume = match volume_dsl::volume.filter....
Some(ov) => ov,
_ => return Err..
pattern here like you do below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, though it does reduce the indentation to bail early here.
} else { | ||
// existing volume was deleted, so return an error, we | ||
// can't perform the region replacement now! | ||
return Err(err.bail(VolumeReplaceRegionError::TargetVolumeDeleted)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only possible outcome other than Some(old_volume)
from the volume_dsl::volume...
a
deleted volume?
I guess the err.bail_retryable_or_else()
will return error directly if the query fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, bail
here will stop the transaction retry from looping. The volume being deleted is fatal for the region replacement request's progress so we bail here.
return Err(err.bail(VolumeReplaceRegionError::SerdeError(e))); | ||
}, | ||
}; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this section just making sure that the replacement did not already happen in another
saga and we don't try to redo work that was already done by someone else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section makes the whole function idempotent, yeah, but not for another saga - only one saga can win the lock that happens at the beginning of the start saga, so it's only guarding against when a node is rerun.
Ok(v) => v, | ||
|
||
Err(e) => { | ||
error!(&log, "error looking for existing region replacement requests for {}: {e}", region.id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does cargo fmt
not try to break this into multiple lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cargo fmt
is so lazy when it comes to macros. I'd still manually shrink these lines though. What happens is that one long line prevents cargo fmt
from working properly across the rest of the file. I've had to fix this before and it's frustrating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, done in 5f5d732
// Request an additional region for this volume: THRESHOLD + 1 is required | ||
// in order to have the proper redundancy. It's important _not_ to delete | ||
// the existing region first, as (if it's still there) then the Crucible | ||
// agent could reuse the allocated port and cause trouble. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check also makes sure that the new region does not land on any of the sleds that have any of the other existing regions as well, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does - this uses the existing volume ID and REGION_REDUNDANCY_THRESHOLD + 1
in the allocation function parameters.
} | ||
|
||
async fn srrs_alloc_new_region_undo( | ||
sagactx: NexusActionContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any chance for a very long running region create to cause a problem here?
Like, could we request a region be created and the agent is off trying to do that but
it has not actually created the region, but we unwind this saga and go to delete the
region, but it's not actually created yet?
Like, could that cause an orphaned region to be left behind? I don't think its possible as
I believe we wait for the region creation to finish before moving forward, here. I also think
that we can request the agent to delete a region and it will delete it (eventually) even if
the creation was in progress when the deletion request for that region came in. Is that
all correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct - the region creation code will wait for a known result (created or otherwise) before proceeding, and so will the region deletion, and therefore the saga will only proceed when a result of a request is known.
// However, we haven't needed it until now! If the Crucible agent is gone | ||
// (which it will be if the disk is expunged), assume that the region in the | ||
// read/write portion of the volume with the same dataset address (there | ||
// should only be one due to the allocation strategy!) is the old region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh.. I wonder if this might some day trip us up. Like, if we are forced to share an agent for
two downstairs as there is no other place to put it.
Is there any way we could put a check somewhere to avoid shooting ourselves in the foot
with this later? Like, something that verifies that all three regions have different targets?
Or, some other way to start populating the table.
I worry that some distant future we relax the unique agent requirement for regions, and we
forget about the assumption made here and replace the wrong downstairs :(
Ah, reading down below, it looks like we will abort if we can't tell which region to replace.
Maybe that is the best we can do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. It should be follow up work to record the port somewhere! Opened #5846
gen: 0, | ||
opts: CrucibleOpts { | ||
id: new_volume_id, | ||
target: vec![old_region_address.to_string()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This old VCR has just a single target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - it is a Volume that will only be used when later passed to the volume delete saga.
old_region_volume_id, | ||
) | ||
.await | ||
.map_err(ActionError::action_failed)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't talk to the database (or whatever could cause an error here), will things attempt to
recover at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the saga node's perspective, it could either spin until a result is known (retrying within the DataStore function), or fail immediately, unwinding the saga. Though, if Nexus can't talk to CRDB, then unwinding will also fail, and there'll need to be follow up work to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with all the details of region allocation. But based on your discussion with Alan, this seems like it's good to merge. I'd let Alan be the final arbiter though.
let old_volume = if let Some(old_volume) = maybe_old_volume { | ||
old_volume | ||
} else { | ||
// existing volume was deleted, so return an error, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// existing volume was deleted, so return an error, we | |
// Existing volume was deleted, so return an error. We |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 85ca4f9
fn region_in_vcr( | ||
vcr: &VolumeConstructionRequest, | ||
region: &SocketAddrV6, | ||
) -> anyhow::Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this function can just return a bool instead of wrapping it in a Result
. There are no errors possible here. That would simplify the callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's unfortunately a let parsed_target: SocketAddrV6 = target.parse()?;
that prevents this :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah damn, I totally missed that. All good.
Ok(v) => v, | ||
|
||
Err(e) => { | ||
error!(&log, "error looking for existing region replacement requests for {}: {e}", region.id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cargo fmt
is so lazy when it comes to macros. I'd still manually shrink these lines though. What happens is that one long line prevents cargo fmt
from working properly across the rest of the file. I've had to fix this before and it's frustrating.
}) | ||
} | ||
.boxed() | ||
} | ||
} | ||
|
||
#[cfg(test)] | ||
mod test { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! :)
request_time: Utc::now(), | ||
old_region_id: region_to_replace.id(), | ||
volume_id: region_to_replace.volume_id(), | ||
old_region_volume_id: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nitty nit. Feel free to ignore.
I find it confusing that the old_region_volume_id
is None
, but volume_id
is what's getting replaced. I realize why based on prior discussion, since a fake volume id gets generated and becomes old_region_volume_id
. Maybe change the name of old_region_volume_id
to faked_old_region_volume_id
or something to make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without changing the schema, this is a little awkward. I think this would work:
diff --git a/nexus/db-model/src/region_replacement.rs b/nexus/db-model/src/region_replacement.rs
index a04710f53..3984581f5 100644
--- a/nexus/db-model/src/region_replacement.rs
+++ b/nexus/db-model/src/region_replacement.rs
@@ -135,7 +135,8 @@ pub struct RegionReplacement {
pub volume_id: Uuid,
/// A synthetic volume that only is used to later delete the old region
- pub old_region_volume_id: Option<Uuid>,
+ #[diesel(column_name = old_region_volume_id)]
+ pub delete_old_region_volume_id: Option<Uuid>,
/// The new region that will be used to replace the old one
pub new_region_id: Option<Uuid>,
but without changing schema.rs
as well would still lead to code referencing both:
use db::schema::region_replacement::dsl;
let updated = diesel::update(dsl::region_replacement)
.filter(dsl::id.eq(region_replacement_id))
.filter(dsl::operating_saga_id.eq(operating_saga_id))
.filter(
dsl::replacement_state.eq(RegionReplacementState::Allocating),
)
.set((
dsl::replacement_state.eq(RegionReplacementState::Running),
dsl::old_region_volume_id.eq(Some(old_region_volume_id)),
dsl::new_region_id.eq(Some(new_region_id)),
dsl::operating_saga_id.eq(Option::<Uuid>::None),
))
.check_if_exists::<RegionReplacement>(region_replacement_id)
.execute_and_check(&*self.pool_connection_authorized(opctx).await?)
.await;
match updated {
Ok(result) => match result.status {
UpdateStatus::Updated => Ok(()),
UpdateStatus::NotUpdatedButExists => {
let record = result.found;
if record.operating_saga_id == None
&& record.replacement_state
== RegionReplacementState::Running
&& record.new_region_id == Some(new_region_id)
&& record.delete_old_region_volume_id // <-------- here
== Some(old_region_volume_id) // <-------- and here
{
Can schema.rs be changed too? I don't think so, it seems like it has to map to the actual column name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, it's fine. Just leave it. No need to go through this hassle right now. #5858 will likely clear up any miscommunication anyway.
&allocated_regions[0].1; | ||
|
||
// Manually insert the replacement request | ||
let request = RegionReplacement { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice if there was a minimal RegionReplacementRequest
type that could be created that didn't require filling in the None
fields and possibly even the replacement_state
field. That not only makes it easier for the caller to write, but makes it easier to read. The other values can be set in a From
impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've thought a bit about this. I'd like to do this work in a follow up PR though, as this would murder the other commits stacked on top of this one! I've opened #5858 and will follow up afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've thought a bit about this. I'd like to do this work in a follow up PR though, as this would murder the other commits stacked on top of this one! I've opened #5858 and will follow up afterwards.
I kinda figured as much. Definitely no rush. Thanks for opening an issue!
Start filling out the region replacement background task skeleton by adding a check for regions that are stored on disks that were expunged, and inserting region replacement requests for them. In the same background task, check if there are any requests in the "Requested" state and running the new "region replacement start" saga for them. This background task will also pick up manually requested region replacements.
Also in this commit is the region replacement start saga, which will take a replacement request that is in state "Requested", and perform the steps to:
allocate a new region
swap the region being replaced with that newly allocated region in the affected volume
create a fake volume that can later be deleted, referencing the region that was swapped out
update the region replacement request's state to "Running" and clearing the operating saga id
This represents the first step to be taken after a region goes away: allocate the replacement, and swap it in to the affected volume. Once this is done, any new checkout and construction of the affected volume will no longer reference the expunged region but will cause reconciliation to take place. It is still degraded in the sense that the newly allocated region is blank and the other two are not, and an Upstairs needs to perform that reconciliation or repair. Existing constructed Volumes running in a propolis or pantry context will remain unmodified: the next commmit will be a saga that takes care of initiating live repair or reconciliation for those existing running Volumes in order to drive either live repair or reconciliation forward.