[#3886 3/4] Start saga for region replacement #5839

jmpesp · 2024-05-30T21:51:35Z

Start filling out the region replacement background task skeleton by adding a check for regions that are stored on disks that were expunged, and inserting region replacement requests for them. In the same background task, check if there are any requests in the "Requested" state and running the new "region replacement start" saga for them. This background task will also pick up manually requested region replacements.

Also in this commit is the region replacement start saga, which will take a replacement request that is in state "Requested", and perform the steps to:

allocate a new region
swap the region being replaced with that newly allocated region in the affected volume
create a fake volume that can later be deleted, referencing the region that was swapped out
update the region replacement request's state to "Running" and clearing the operating saga id

This represents the first step to be taken after a region goes away: allocate the replacement, and swap it in to the affected volume. Once this is done, any new checkout and construction of the affected volume will no longer reference the expunged region but will cause reconciliation to take place. It is still degraded in the sense that the newly allocated region is blank and the other two are not, and an Upstairs needs to perform that reconciliation or repair. Existing constructed Volumes running in a propolis or pantry context will remain unmodified: the next commmit will be a saga that takes care of initiating live repair or reconciliation for those existing running Volumes in order to drive either live repair or reconciliation forward.

Start filling out the region replacement background task skeleton by adding a check for regions that are stored on disks that were expunged, and inserting region replacement requests for them. In the same background task, check if there are any requests in the "Requested" state and running the new "region replacement start" saga for them. This background task will also pick up manually requested region replacements. Also in this commit is the region replacement start saga, which will take a replacement request that is in state "Requested", and perform the steps to: - allocate a new region - swap the region being replaced with that newly allocated region in the affected volume - create a fake volume that can later be deleted, referencing the region that was swapped out - update the region replacement request's state to "Running" and clearing the operating saga id This represents the first step to be taken after a region goes away: allocate the replacement, and swap it in to the affected volume. Once this is done, any new checkout and construction of the affected volume will no longer reference the expunged region but will cause reconciliation to take place. It is still degraded in the sense that the newly allocated region is blank and the other two are not, and an Upstairs needs to perform that reconciliation or repair. Existing constructed Volumes running in a propolis or pantry context will remain unmodified: the next commmit will be a saga that takes care of initiating live repair or reconciliation for those existing running Volumes in order to drive either live repair or reconciliation forward.

leftwo

Mostly questions I have to help me understand

leftwo · 2024-05-31T01:21:07Z

nexus/db-queries/src/db/datastore/volume.rs

+                }
+
+                // Skip looking at read-only parent, this function only looks
+                // for R/W regions


leftwo · 2024-05-31T03:26:47Z

nexus/db-queries/src/db/datastore/volume.rs

+                    };
+
+                    let old_volume = if let Some(old_volume) = maybe_old_volume {
+                        old_volume


Any reason you don't do the

let old_volume = match volume_dsl::volume.filter.... Some(ov) => ov, _ => return Err..

pattern here like you do below?

Not really, though it does reduce the indentation to bail early here.

leftwo · 2024-05-31T15:02:30Z

nexus/db-queries/src/db/datastore/volume.rs

+                    } else {
+                        // existing volume was deleted, so return an error, we
+                        // can't perform the region replacement now!
+                        return Err(err.bail(VolumeReplaceRegionError::TargetVolumeDeleted));


Is the only possible outcome other than Some(old_volume) from the volume_dsl::volume... a
deleted volume?
I guess the err.bail_retryable_or_else() will return error directly if the query fails?

As I understand it, bail here will stop the transaction retry from looping. The volume being deleted is fatal for the region replacement request's progress so we bail here.

leftwo · 2024-05-31T15:10:52Z

nexus/db-queries/src/db/datastore/volume.rs

+                                return Err(err.bail(VolumeReplaceRegionError::SerdeError(e)));
+                            },
+                        };
+


Is this section just making sure that the replacement did not already happen in another
saga and we don't try to redo work that was already done by someone else?

This section makes the whole function idempotent, yeah, but not for another saga - only one saga can win the lock that happens at the beginning of the start saga, so it's only guarding against when a node is rerun.

leftwo · 2024-05-31T15:20:53Z

nexus/src/app/background/region_replacement.rs

+                        Ok(v) => v,
+
+                        Err(e) => {
+                            error!(&log, "error looking for existing region replacement requests for {}: {e}", region.id());


Does cargo fmt not try to break this into multiple lines?

It doesn't!

cargo fmt is so lazy when it comes to macros. I'd still manually shrink these lines though. What happens is that one long line prevents cargo fmt from working properly across the rest of the file. I've had to fix this before and it's frustrating.

yeah, done in 5f5d732

leftwo · 2024-05-31T16:16:40Z

nexus/src/app/sagas/region_replacement_start.rs

+    // Request an additional region for this volume: THRESHOLD + 1 is required
+    // in order to have the proper redundancy. It's important _not_ to delete
+    // the existing region first, as (if it's still there) then the Crucible
+    // agent could reuse the allocated port and cause trouble.


This check also makes sure that the new region does not land on any of the sleds that have any of the other existing regions as well, right?

It does - this uses the existing volume ID and REGION_REDUNDANCY_THRESHOLD + 1 in the allocation function parameters.

leftwo · 2024-05-31T16:21:57Z

nexus/src/app/sagas/region_replacement_start.rs

+}
+
+async fn srrs_alloc_new_region_undo(
+    sagactx: NexusActionContext,


Is there any chance for a very long running region create to cause a problem here?
Like, could we request a region be created and the agent is off trying to do that but
it has not actually created the region, but we unwind this saga and go to delete the
region, but it's not actually created yet?

Like, could that cause an orphaned region to be left behind? I don't think its possible as
I believe we wait for the region creation to finish before moving forward, here. I also think
that we can request the agent to delete a region and it will delete it (eventually) even if
the creation was in progress when the deletion request for that region came in. Is that
all correct?

You're correct - the region creation code will wait for a known result (created or otherwise) before proceeding, and so will the region deletion, and therefore the saga will only proceed when a result of a request is known.

leftwo · 2024-05-31T16:35:53Z

nexus/src/app/sagas/region_replacement_start.rs

+    // However, we haven't needed it until now! If the Crucible agent is gone
+    // (which it will be if the disk is expunged), assume that the region in the
+    // read/write portion of the volume with the same dataset address (there
+    // should only be one due to the allocation strategy!) is the old region.


Ohh.. I wonder if this might some day trip us up. Like, if we are forced to share an agent for
two downstairs as there is no other place to put it.
Is there any way we could put a check somewhere to avoid shooting ourselves in the foot
with this later? Like, something that verifies that all three regions have different targets?
Or, some other way to start populating the table.

I worry that some distant future we relax the unique agent requirement for regions, and we
forget about the assumption made here and replace the wrong downstairs :(

Ah, reading down below, it looks like we will abort if we can't tell which region to replace.
Maybe that is the best we can do here.

This is a very good point. It should be follow up work to record the port somewhere! Opened #5846

leftwo · 2024-05-31T17:46:03Z

nexus/src/app/sagas/region_replacement_start.rs

+            gen: 0,
+            opts: CrucibleOpts {
+                id: new_volume_id,
+                target: vec![old_region_address.to_string()],


This old VCR has just a single target?

Yes - it is a Volume that will only be used when later passed to the volume delete saga.

leftwo · 2024-05-31T17:50:58Z

nexus/src/app/sagas/region_replacement_start.rs

+            old_region_volume_id,
+        )
+        .await
+        .map_err(ActionError::action_failed)?;


If we can't talk to the database (or whatever could cause an error here), will things attempt to
recover at some point?

From the saga node's perspective, it could either spin until a result is known (retrying within the DataStore function), or fail immediately, unwinding the saga. Though, if Nexus can't talk to CRDB, then unwinding will also fail, and there'll need to be follow up work to address this.

andrewjstone

I'm not super familiar with all the details of region allocation. But based on your discussion with Alan, this seems like it's good to merge. I'd let Alan be the final arbiter though.

andrewjstone · 2024-06-04T16:50:10Z

nexus/db-queries/src/db/datastore/volume.rs

+                    let old_volume = if let Some(old_volume) = maybe_old_volume {
+                        old_volume
+                    } else {
+                        // existing volume was deleted, so return an error, we


Suggested change

// existing volume was deleted, so return an error, we

// Existing volume was deleted, so return an error. We

done in 85ca4f9

andrewjstone · 2024-06-04T16:56:58Z

nexus/db-queries/src/db/datastore/volume.rs

+fn region_in_vcr(
+    vcr: &VolumeConstructionRequest,
+    region: &SocketAddrV6,
+) -> anyhow::Result<bool> {


It looks like this function can just return a bool instead of wrapping it in a Result. There are no errors possible here. That would simplify the callers.

There's unfortunately a let parsed_target: SocketAddrV6 = target.parse()?; that prevents this :(

Ah damn, I totally missed that. All good.

andrewjstone · 2024-06-04T17:37:18Z

nexus/src/app/background/region_replacement.rs

+                        Ok(v) => v,
+
+                        Err(e) => {
+                            error!(&log, "error looking for existing region replacement requests for {}: {e}", region.id());


cargo fmt is so lazy when it comes to macros. I'd still manually shrink these lines though. What happens is that one long line prevents cargo fmt from working properly across the rest of the file. I've had to fix this before and it's frustrating.

andrewjstone · 2024-06-04T17:41:53Z

nexus/src/app/background/region_replacement.rs

            })
        }
        .boxed()
    }
 }
+
+#[cfg(test)]
+mod test {


andrewjstone · 2024-06-04T17:57:51Z

nexus/src/app/sagas/region_replacement_start.rs

+            request_time: Utc::now(),
+            old_region_id: region_to_replace.id(),
+            volume_id: region_to_replace.volume_id(),
+            old_region_volume_id: None,


This is a nitty nit. Feel free to ignore.

I find it confusing that the old_region_volume_id is None, but volume_id is what's getting replaced. I realize why based on prior discussion, since a fake volume id gets generated and becomes old_region_volume_id. Maybe change the name of old_region_volume_id to faked_old_region_volume_id or something to make this more clear.

Without changing the schema, this is a little awkward. I think this would work:

diff --git a/nexus/db-model/src/region_replacement.rs b/nexus/db-model/src/region_replacement.rs index a04710f53..3984581f5 100644 --- a/nexus/db-model/src/region_replacement.rs +++ b/nexus/db-model/src/region_replacement.rs @@ -135,7 +135,8 @@ pub struct RegionReplacement { pub volume_id: Uuid, /// A synthetic volume that only is used to later delete the old region - pub old_region_volume_id: Option<Uuid>, + #[diesel(column_name = old_region_volume_id)] + pub delete_old_region_volume_id: Option<Uuid>, /// The new region that will be used to replace the old one pub new_region_id: Option<Uuid>,

but without changing schema.rs as well would still lead to code referencing both:

use db::schema::region_replacement::dsl; let updated = diesel::update(dsl::region_replacement) .filter(dsl::id.eq(region_replacement_id)) .filter(dsl::operating_saga_id.eq(operating_saga_id)) .filter( dsl::replacement_state.eq(RegionReplacementState::Allocating), ) .set(( dsl::replacement_state.eq(RegionReplacementState::Running), dsl::old_region_volume_id.eq(Some(old_region_volume_id)), dsl::new_region_id.eq(Some(new_region_id)), dsl::operating_saga_id.eq(Option::<Uuid>::None), )) .check_if_exists::<RegionReplacement>(region_replacement_id) .execute_and_check(&*self.pool_connection_authorized(opctx).await?) .await; match updated { Ok(result) => match result.status { UpdateStatus::Updated => Ok(()), UpdateStatus::NotUpdatedButExists => { let record = result.found; if record.operating_saga_id == None && record.replacement_state == RegionReplacementState::Running && record.new_region_id == Some(new_region_id) && record.delete_old_region_volume_id // <-------- here == Some(old_region_volume_id) // <-------- and here {

Can schema.rs be changed too? I don't think so, it seems like it has to map to the actual column name.

Eh, it's fine. Just leave it. No need to go through this hassle right now. #5858 will likely clear up any miscommunication anyway.

andrewjstone · 2024-06-04T17:59:20Z

nexus/src/app/sagas/region_replacement_start.rs

+            &allocated_regions[0].1;
+
+        // Manually insert the replacement request
+        let request = RegionReplacement {


I think it would be nice if there was a minimal RegionReplacementRequest type that could be created that didn't require filling in the None fields and possibly even the replacement_state field. That not only makes it easier for the caller to write, but makes it easier to read. The other values can be set in a From impl.

Yeah, I've thought a bit about this. I'd like to do this work in a follow up PR though, as this would murder the other commits stacked on top of this one! I've opened #5858 and will follow up afterwards.

Yeah, I've thought a bit about this. I'd like to do this work in a follow up PR though, as this would murder the other commits stacked on top of this one! I've opened #5858 and will follow up afterwards.

I kinda figured as much. Definitely no rush. Thanks for opening an issue!

jmpesp requested review from andrewjstone and leftwo May 30, 2024 21:51

leftwo reviewed May 31, 2024

View reviewed changes

jmpesp added 2 commits May 31, 2024 22:56

get doc test passing

f6fd7f0

not THE region, A region

62f0aa8

andrewjstone approved these changes Jun 4, 2024

View reviewed changes

jmpesp added 2 commits June 6, 2024 02:53

manually fmt long strings, then cargo fmt

5f5d732

caps

85ca4f9

jmpesp enabled auto-merge (squash) June 6, 2024 19:40

jmpesp merged commit 6c324c3 into oxidecomputer:main Jun 6, 2024
14 checks passed

jmpesp deleted the region_replacement_part_3 branch June 7, 2024 14:05

	// existing volume was deleted, so return an error, we
	// Existing volume was deleted, so return an error. We

[#3886 3/4] Start saga for region replacement #5839

[#3886 3/4] Start saga for region replacement #5839

Conversation

jmpesp commented May 30, 2024

leftwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewjstone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment