Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#3886 3/4] Start saga for region replacement #5839

Merged
merged 5 commits into from
Jun 6, 2024

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented May 30, 2024

Start filling out the region replacement background task skeleton by adding a check for regions that are stored on disks that were expunged, and inserting region replacement requests for them. In the same background task, check if there are any requests in the "Requested" state and running the new "region replacement start" saga for them. This background task will also pick up manually requested region replacements.

Also in this commit is the region replacement start saga, which will take a replacement request that is in state "Requested", and perform the steps to:

  • allocate a new region

  • swap the region being replaced with that newly allocated region in the affected volume

  • create a fake volume that can later be deleted, referencing the region that was swapped out

  • update the region replacement request's state to "Running" and clearing the operating saga id

This represents the first step to be taken after a region goes away: allocate the replacement, and swap it in to the affected volume. Once this is done, any new checkout and construction of the affected volume will no longer reference the expunged region but will cause reconciliation to take place. It is still degraded in the sense that the newly allocated region is blank and the other two are not, and an Upstairs needs to perform that reconciliation or repair. Existing constructed Volumes running in a propolis or pantry context will remain unmodified: the next commmit will be a saga that takes care of initiating live repair or reconciliation for those existing running Volumes in order to drive either live repair or reconciliation forward.

Start filling out the region replacement background task skeleton by
adding a check for regions that are stored on disks that were expunged,
and inserting region replacement requests for them. In the same
background task, check if there are any requests in the "Requested"
state and running the new "region replacement start" saga for them. This
background task will also pick up manually requested region
replacements.

Also in this commit is the region replacement start saga, which will
take a replacement request that is in state "Requested", and perform the
steps to:

- allocate a new region

- swap the region being replaced with that newly allocated region in the
  affected volume

- create a fake volume that can later be deleted, referencing the region
  that was swapped out

- update the region replacement request's state to "Running" and
  clearing the operating saga id

This represents the first step to be taken after a region goes away:
allocate the replacement, and swap it in to the affected volume. Once
this is done, any new checkout and construction of the affected volume
will no longer reference the expunged region but will cause
reconciliation to take place. It is still degraded in the sense that the
newly allocated region is blank and the other two are not, and an
Upstairs needs to perform that reconciliation or repair. Existing
constructed Volumes running in a propolis or pantry context will remain
unmodified: the next commmit will be a saga that takes care of
initiating live repair or reconciliation for those existing running
Volumes in order to drive either live repair or reconciliation forward.
Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions I have to help me understand

}

// Skip looking at read-only parent, this function only looks
// for R/W regions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now...

};

let old_volume = if let Some(old_volume) = maybe_old_volume {
old_volume
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you don't do the

let old_volume = match volume_dsl::volume.filter....
    Some(ov) => ov,
    _ => return Err..

pattern here like you do below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, though it does reduce the indentation to bail early here.

} else {
// existing volume was deleted, so return an error, we
// can't perform the region replacement now!
return Err(err.bail(VolumeReplaceRegionError::TargetVolumeDeleted));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the only possible outcome other than Some(old_volume) from the volume_dsl::volume... a
deleted volume?
I guess the err.bail_retryable_or_else() will return error directly if the query fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, bail here will stop the transaction retry from looping. The volume being deleted is fatal for the region replacement request's progress so we bail here.

return Err(err.bail(VolumeReplaceRegionError::SerdeError(e)));
},
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this section just making sure that the replacement did not already happen in another
saga and we don't try to redo work that was already done by someone else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section makes the whole function idempotent, yeah, but not for another saga - only one saga can win the lock that happens at the beginning of the start saga, so it's only guarding against when a node is rerun.

Ok(v) => v,

Err(e) => {
error!(&log, "error looking for existing region replacement requests for {}: {e}", region.id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does cargo fmt not try to break this into multiple lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo fmt is so lazy when it comes to macros. I'd still manually shrink these lines though. What happens is that one long line prevents cargo fmt from working properly across the rest of the file. I've had to fix this before and it's frustrating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, done in 5f5d732

// Request an additional region for this volume: THRESHOLD + 1 is required
// in order to have the proper redundancy. It's important _not_ to delete
// the existing region first, as (if it's still there) then the Crucible
// agent could reuse the allocated port and cause trouble.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check also makes sure that the new region does not land on any of the sleds that have any of the other existing regions as well, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does - this uses the existing volume ID and REGION_REDUNDANCY_THRESHOLD + 1 in the allocation function parameters.

}

async fn srrs_alloc_new_region_undo(
sagactx: NexusActionContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance for a very long running region create to cause a problem here?
Like, could we request a region be created and the agent is off trying to do that but
it has not actually created the region, but we unwind this saga and go to delete the
region, but it's not actually created yet?

Like, could that cause an orphaned region to be left behind? I don't think its possible as
I believe we wait for the region creation to finish before moving forward, here. I also think
that we can request the agent to delete a region and it will delete it (eventually) even if
the creation was in progress when the deletion request for that region came in. Is that
all correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct - the region creation code will wait for a known result (created or otherwise) before proceeding, and so will the region deletion, and therefore the saga will only proceed when a result of a request is known.

// However, we haven't needed it until now! If the Crucible agent is gone
// (which it will be if the disk is expunged), assume that the region in the
// read/write portion of the volume with the same dataset address (there
// should only be one due to the allocation strategy!) is the old region.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh.. I wonder if this might some day trip us up. Like, if we are forced to share an agent for
two downstairs as there is no other place to put it.
Is there any way we could put a check somewhere to avoid shooting ourselves in the foot
with this later? Like, something that verifies that all three regions have different targets?
Or, some other way to start populating the table.

I worry that some distant future we relax the unique agent requirement for regions, and we
forget about the assumption made here and replace the wrong downstairs :(

Ah, reading down below, it looks like we will abort if we can't tell which region to replace.
Maybe that is the best we can do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point. It should be follow up work to record the port somewhere! Opened #5846

gen: 0,
opts: CrucibleOpts {
id: new_volume_id,
target: vec![old_region_address.to_string()],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This old VCR has just a single target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it is a Volume that will only be used when later passed to the volume delete saga.

old_region_volume_id,
)
.await
.map_err(ActionError::action_failed)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't talk to the database (or whatever could cause an error here), will things attempt to
recover at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the saga node's perspective, it could either spin until a result is known (retrying within the DataStore function), or fail immediately, unwinding the saga. Though, if Nexus can't talk to CRDB, then unwinding will also fail, and there'll need to be follow up work to address this.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super familiar with all the details of region allocation. But based on your discussion with Alan, this seems like it's good to merge. I'd let Alan be the final arbiter though.

let old_volume = if let Some(old_volume) = maybe_old_volume {
old_volume
} else {
// existing volume was deleted, so return an error, we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// existing volume was deleted, so return an error, we
// Existing volume was deleted, so return an error. We

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 85ca4f9

fn region_in_vcr(
vcr: &VolumeConstructionRequest,
region: &SocketAddrV6,
) -> anyhow::Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this function can just return a bool instead of wrapping it in a Result. There are no errors possible here. That would simplify the callers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's unfortunately a let parsed_target: SocketAddrV6 = target.parse()?; that prevents this :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah damn, I totally missed that. All good.

Ok(v) => v,

Err(e) => {
error!(&log, "error looking for existing region replacement requests for {}: {e}", region.id());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo fmt is so lazy when it comes to macros. I'd still manually shrink these lines though. What happens is that one long line prevents cargo fmt from working properly across the rest of the file. I've had to fix this before and it's frustrating.

})
}
.boxed()
}
}

#[cfg(test)]
mod test {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :)

request_time: Utc::now(),
old_region_id: region_to_replace.id(),
volume_id: region_to_replace.volume_id(),
old_region_volume_id: None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nitty nit. Feel free to ignore.

I find it confusing that the old_region_volume_id is None, but volume_id is what's getting replaced. I realize why based on prior discussion, since a fake volume id gets generated and becomes old_region_volume_id. Maybe change the name of old_region_volume_id to faked_old_region_volume_id or something to make this more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without changing the schema, this is a little awkward. I think this would work:

diff --git a/nexus/db-model/src/region_replacement.rs b/nexus/db-model/src/region_replacement.rs
index a04710f53..3984581f5 100644
--- a/nexus/db-model/src/region_replacement.rs
+++ b/nexus/db-model/src/region_replacement.rs
@@ -135,7 +135,8 @@ pub struct RegionReplacement {
     pub volume_id: Uuid,
 
     /// A synthetic volume that only is used to later delete the old region
-    pub old_region_volume_id: Option<Uuid>,
+    #[diesel(column_name = old_region_volume_id)]
+    pub delete_old_region_volume_id: Option<Uuid>,
 
     /// The new region that will be used to replace the old one
     pub new_region_id: Option<Uuid>,

but without changing schema.rs as well would still lead to code referencing both:

        use db::schema::region_replacement::dsl;
        let updated = diesel::update(dsl::region_replacement)
            .filter(dsl::id.eq(region_replacement_id))
            .filter(dsl::operating_saga_id.eq(operating_saga_id))
            .filter(
                dsl::replacement_state.eq(RegionReplacementState::Allocating),
            )
            .set((
                dsl::replacement_state.eq(RegionReplacementState::Running),
                dsl::old_region_volume_id.eq(Some(old_region_volume_id)),
                dsl::new_region_id.eq(Some(new_region_id)),
                dsl::operating_saga_id.eq(Option::<Uuid>::None),
            ))
            .check_if_exists::<RegionReplacement>(region_replacement_id)
            .execute_and_check(&*self.pool_connection_authorized(opctx).await?)
            .await;
    
        match updated {
            Ok(result) => match result.status {
                UpdateStatus::Updated => Ok(()),
                UpdateStatus::NotUpdatedButExists => {
                    let record = result.found;

                    if record.operating_saga_id == None
                        && record.replacement_state
                            == RegionReplacementState::Running
                        && record.new_region_id == Some(new_region_id)
                        && record.delete_old_region_volume_id // <-------- here
                            == Some(old_region_volume_id) // <-------- and here
                    {

Can schema.rs be changed too? I don't think so, it seems like it has to map to the actual column name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, it's fine. Just leave it. No need to go through this hassle right now. #5858 will likely clear up any miscommunication anyway.

&allocated_regions[0].1;

// Manually insert the replacement request
let request = RegionReplacement {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if there was a minimal RegionReplacementRequest type that could be created that didn't require filling in the None fields and possibly even the replacement_state field. That not only makes it easier for the caller to write, but makes it easier to read. The other values can be set in a From impl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've thought a bit about this. I'd like to do this work in a follow up PR though, as this would murder the other commits stacked on top of this one! I've opened #5858 and will follow up afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've thought a bit about this. I'd like to do this work in a follow up PR though, as this would murder the other commits stacked on top of this one! I've opened #5858 and will follow up afterwards.

I kinda figured as much. Definitely no rush. Thanks for opening an issue!

@jmpesp jmpesp enabled auto-merge (squash) June 6, 2024 19:40
@jmpesp jmpesp merged commit 6c324c3 into oxidecomputer:main Jun 6, 2024
14 checks passed
@jmpesp jmpesp deleted the region_replacement_part_3 branch June 7, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants