Support arbitrary redundancy for region allocation #5346

jmpesp · 2024-03-28T21:50:27Z

Add a num_regions_required argument for region allocation, and a new arbitrary_region_allocate function.

Add a `num_regions_required` argument for region allocation, and a new `arbitrary_region_allocate` function. Fixes oxidecomputer#5119.

smklein

I have a couple comments / nitpicks, but this structure looks good to me, and the tests make sense!

nexus/db-queries/src/db/queries/region_allocation.rs

nexus/tests/integration_tests/disks.rs

smklein · 2024-03-29T01:24:23Z

nexus/db-queries/src/db/datastore/region.rs

+    /// For region replacement, it's important to allocate the *new* region for
+    /// a volume while respecting the current region allocation strategy.  This
+    /// requires setting `num_regions_required` to one more than the current
+    /// level for a volume. If a single region is allocated in isolation this
+    /// could land on the same dataset as one of the existing volume's regions.


I'm okay with this strategy, but is there a reason we aren't trying the following for region replacement?

Delete the "bad" region(s)

Re-run allocate

Since it strives to be idempotent, and if we adequately marked the old region's disk (or sled) as "expunged", we wouldn't re-allocate to the same sled. And presumably we'd only succeed if we could actually restore to the redundancy count (3).

(I do still think several of your changes in this CTE are essential either way, just wondering about the "go to 4, even though we know one is dead" strategy vs "drop to 2 regions when we know one was expunged, then re-run allocation to get us back to 3")

I had imagined needing this CTE to work in a scenario where we had not yet expunged the disk - if a disk is not expunged / completely gone, it may not be completely dead yet, and if we're running a saga it could still unwind to a world with the original 3 regions. I'm not sure this makes sense though. Even if a disk is not completely dead, it may be hosting many Downstairs that are returning errors, and should not be trusted anymore.

Another reason is that going from 3 to 4 is more in line with what is happening to the Volume itself, because we won't be removing the bad region from the Volume's targets, writing that to the db, and then adding a new one - at no point should there be only two regions in the targets, or the Upstairs will panic. This isn't a great reason, and it's possible that is a bad assert that should be removed regardless of what happens here...

I don't actually think it's bad to allow the option here to go either way, but I think it's reasonable to let us "purge the old (known bad) region, even if we're prevented from allocating a new region, for whatever reason".

Pretending that we have 3 -> 4 regions when we're actually moving from 2 -> 3 just seems a little misleading -- like, the upstairs is actually operating at reduced capacity, but it doesn't know.

Related to this - I don't think we've built any mechanism to "update existing instances" that their regions might have changed, right?

For example:

Suppose we have several sleds, each with several disks

An instance is provisioned on one of those sleds, using several disks across the rack for regions

The disk associated with one of those regions is expunged. The instance, however, is still running.

As a result, the "upstairs" is still trying to contact "three regions", even though one is gone.

If we go through region replacement, will anyone tell the upstairs to update its view of the world?

(To be clear, I don't view this as a blocker for this PR, but I just want to understand how this handoff is going to work a little better, to make sure I'm vigilant about things that could later become footguns)

I don't actually think it's bad to allow the option here to go either way, but I think it's reasonable to let us "purge the old (known bad) region, even if we're prevented from allocating a new region, for whatever reason".

I've added a test for this in 73073a9.

I just want to understand how this handoff is going to work a little better, to make sure I'm vigilant about things that could later become footguns

Here's my current mental model of how Nexus directed region replacement works:

a Volume V has an Upstairs that points to regions [A, B, C]

a disk is expunged, which happened to have region C

some time after marking a disk as expunged, a background task wakes up and creates region replacement requests for every region affected by that expungement.

each region replacement request will kick off a "region replacement start" saga, which

allocates a new region record (D) for Volume V

goes through the usual create region flow with one of the non-expunged crucible agents

updates the region replacement request record noting this new region D replaces C

updates Volume V's construction request to use the new region D (so the affected Upstairs will now point to [A, B, D])

If there is a propolis running, it will be sent a InstanceVcrReplace request with the new construction request for V. This will kick off live repair. The Instance can still issue IO to the Volume.

If there isn't a propolis running, then V will be attached to a Pantry, and reconciliation will occur. Activation won't return until reconciliation is finished.

Once the live repair or reconciliation is finished, then region C has been fully replaced. The "region replacement finish" saga cleans things up.

Of course, this is a straight line + simplified happy path, and there's more complexity in handling all the things that can happen:

instances can be powered off in the middle of a repair

instances can migrate

instance can be started at any time (eg before the pantry is done reconciling)

what if anything (region allocation, live repair, or reconciliation) fails?

I'm pretty sure I have these cases covered (and will of course be extensively testing!)

I think the whole process could be done either way (N -> N+1 -> N, or N -> N-1 -> N). The process above (specifically step 4.1) assumes the N+1 method, but all Nexus needs to maintain is a record of "region C needs to be replaced by region D" (step 4.3) - as long as that record exists, the Volume construction request can be modified correctly even if region C has been deleted.

Actually, I'm not sure this PR is needed if we're always deleting the region straight away, because the allocation CTE would bring us back to N (and N wouldn't have to change). The way I've written the whole region replacement process right now assumes that the old region sticks around in the DB (even if the actual physical disk is gone) until the very end, where it goes through the regular volume delete saga flow. That volume delete call could be pushed up much closer to the beginning of the process.

Pretending that we have 3 -> 4 regions when we're actually moving from 2 -> 3 just seems a little misleading -- like, the upstairs is actually operating at reduced capacity, but it doesn't know.

Even without any region replacement logic in Nexus, the Upstairs (if it is running) does know it's operating at reduced capacity if one of the downstairs is completely gone. If a Volume isn't attached anywhere, then this statement is true - that Volume has region sets that point to regions that are gone and this will only be "known" when that Volume is Volume::constructed somewhere.

Related to this - I don't think we've built any mechanism to "update existing instances" that their regions might have changed, right?

That's coming next haha :)

As we see with your test below - I think that as long as the region allocation CTE "does the right thing" when we go from 2 -> 3 or from 3 -> 4, I'm happy. That gives us flexibility to construct either situation, and it lets us call to the Upstairs with either option.

Thanks for the explanation, this plan seems reasonable to me. Seems like we have the building blocks to update existing instances, we just need to call 'em.

Just a drive by comment to thank you for the wonderful description @jmpesp

nexus/db-queries/src/db/queries/region_allocation.rs

smklein · 2024-04-03T19:46:18Z

nexus/tests/integration_tests/disks.rs

+// Confirm that a region set can start at N, a region can be deleted, and the
+// allocation CTE can bring the redundancy back to N.


Love to see it, this mitigates any fears I had about flexibility from the Nexus side!

Support arbitrary redundancy for region allocation

06e5d13

Add a `num_regions_required` argument for region allocation, and a new `arbitrary_region_allocate` function. Fixes oxidecomputer#5119.

jmpesp requested a review from smklein March 28, 2024 21:50

smklein reviewed Mar 29, 2024

View reviewed changes

jmpesp added 5 commits April 2, 2024 15:28

Merge branch 'main' into arbitrary_region_allocation_redundancy

81ca9f2

fix after merge

be5e99c

use REGION_REDUNDANCY_THRESHOLD

8d40880

remove candidate_zpools_filtered, fold WHERE into candidate_zpools

a083bb8

add test_region_allocation_after_delete

73073a9

smklein reviewed Apr 3, 2024

View reviewed changes

smklein approved these changes Apr 3, 2024

View reviewed changes

jmpesp enabled auto-merge (squash) April 3, 2024 20:13

jmpesp merged commit 36f435b into oxidecomputer:main Apr 3, 2024
15 checks passed

jmpesp deleted the arbitrary_region_allocation_redundancy branch April 3, 2024 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support arbitrary redundancy for region allocation #5346

Support arbitrary redundancy for region allocation #5346

jmpesp commented Mar 28, 2024

smklein left a comment

smklein Mar 29, 2024

jmpesp Apr 2, 2024

smklein Apr 2, 2024

smklein Apr 2, 2024

jmpesp Apr 3, 2024 •

edited

Loading

smklein Apr 3, 2024

andrewjstone Apr 4, 2024

smklein Apr 3, 2024

		// Confirm that a region set can start at N, a region can be deleted, and the
		// allocation CTE can bring the redundancy back to N.

Support arbitrary redundancy for region allocation #5346

Support arbitrary redundancy for region allocation #5346

Conversation

jmpesp commented Mar 28, 2024

smklein left a comment

Choose a reason for hiding this comment

smklein Mar 29, 2024

Choose a reason for hiding this comment

jmpesp Apr 2, 2024

Choose a reason for hiding this comment

smklein Apr 2, 2024

Choose a reason for hiding this comment

smklein Apr 2, 2024

Choose a reason for hiding this comment

jmpesp Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

smklein Apr 3, 2024

Choose a reason for hiding this comment

andrewjstone Apr 4, 2024

Choose a reason for hiding this comment

smklein Apr 3, 2024

Choose a reason for hiding this comment

jmpesp Apr 3, 2024 •

edited

Loading