Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support arbitrary redundancy for region allocation #5346

Merged

Conversation

jmpesp
Copy link
Contributor

@jmpesp jmpesp commented Mar 28, 2024

Add a num_regions_required argument for region allocation, and a new arbitrary_region_allocate function.

Fixes #5119.

Add a `num_regions_required` argument for region allocation, and a new
`arbitrary_region_allocate` function.

Fixes oxidecomputer#5119.
@jmpesp jmpesp requested a review from smklein March 28, 2024 21:50
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple comments / nitpicks, but this structure looks good to me, and the tests make sense!

nexus/tests/integration_tests/disks.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/disks.rs Outdated Show resolved Hide resolved
nexus/tests/integration_tests/disks.rs Outdated Show resolved Hide resolved
Comment on lines +154 to +158
/// For region replacement, it's important to allocate the *new* region for
/// a volume while respecting the current region allocation strategy. This
/// requires setting `num_regions_required` to one more than the current
/// level for a volume. If a single region is allocated in isolation this
/// could land on the same dataset as one of the existing volume's regions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this strategy, but is there a reason we aren't trying the following for region replacement?

  • Delete the "bad" region(s)
  • Re-run allocate

Since it strives to be idempotent, and if we adequately marked the old region's disk (or sled) as "expunged", we wouldn't re-allocate to the same sled. And presumably we'd only succeed if we could actually restore to the redundancy count (3).

(I do still think several of your changes in this CTE are essential either way, just wondering about the "go to 4, even though we know one is dead" strategy vs "drop to 2 regions when we know one was expunged, then re-run allocation to get us back to 3")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had imagined needing this CTE to work in a scenario where we had not yet expunged the disk - if a disk is not expunged / completely gone, it may not be completely dead yet, and if we're running a saga it could still unwind to a world with the original 3 regions. I'm not sure this makes sense though. Even if a disk is not completely dead, it may be hosting many Downstairs that are returning errors, and should not be trusted anymore.

Another reason is that going from 3 to 4 is more in line with what is happening to the Volume itself, because we won't be removing the bad region from the Volume's targets, writing that to the db, and then adding a new one - at no point should there be only two regions in the targets, or the Upstairs will panic. This isn't a great reason, and it's possible that is a bad assert that should be removed regardless of what happens here...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think it's bad to allow the option here to go either way, but I think it's reasonable to let us "purge the old (known bad) region, even if we're prevented from allocating a new region, for whatever reason".

Pretending that we have 3 -> 4 regions when we're actually moving from 2 -> 3 just seems a little misleading -- like, the upstairs is actually operating at reduced capacity, but it doesn't know.

Related to this - I don't think we've built any mechanism to "update existing instances" that their regions might have changed, right?

For example:

  • Suppose we have several sleds, each with several disks
  • An instance is provisioned on one of those sleds, using several disks across the rack for regions
  • The disk associated with one of those regions is expunged. The instance, however, is still running.
  • As a result, the "upstairs" is still trying to contact "three regions", even though one is gone.
  • If we go through region replacement, will anyone tell the upstairs to update its view of the world?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(To be clear, I don't view this as a blocker for this PR, but I just want to understand how this handoff is going to work a little better, to make sure I'm vigilant about things that could later become footguns)

Copy link
Contributor Author

@jmpesp jmpesp Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think it's bad to allow the option here to go either way, but I think it's reasonable to let us "purge the old (known bad) region, even if we're prevented from allocating a new region, for whatever reason".

I've added a test for this in 73073a9.

I just want to understand how this handoff is going to work a little better, to make sure I'm vigilant about things that could later become footguns

Here's my current mental model of how Nexus directed region replacement works:

  1. a Volume V has an Upstairs that points to regions [A, B, C]
  2. a disk is expunged, which happened to have region C
  3. some time after marking a disk as expunged, a background task wakes up and creates region replacement requests for every region affected by that expungement.
  4. each region replacement request will kick off a "region replacement start" saga, which
    1. allocates a new region record (D) for Volume V
    2. goes through the usual create region flow with one of the non-expunged crucible agents
    3. updates the region replacement request record noting this new region D replaces C
    4. updates Volume V's construction request to use the new region D (so the affected Upstairs will now point to [A, B, D])

If there is a propolis running, it will be sent a InstanceVcrReplace request with the new construction request for V. This will kick off live repair. The Instance can still issue IO to the Volume.

If there isn't a propolis running, then V will be attached to a Pantry, and reconciliation will occur. Activation won't return until reconciliation is finished.

Once the live repair or reconciliation is finished, then region C has been fully replaced. The "region replacement finish" saga cleans things up.

Of course, this is a straight line + simplified happy path, and there's more complexity in handling all the things that can happen:

  • instances can be powered off in the middle of a repair
  • instances can migrate
  • instance can be started at any time (eg before the pantry is done reconciling)
  • what if anything (region allocation, live repair, or reconciliation) fails?

I'm pretty sure I have these cases covered (and will of course be extensively testing!)

I think the whole process could be done either way (N -> N+1 -> N, or N -> N-1 -> N). The process above (specifically step 4.1) assumes the N+1 method, but all Nexus needs to maintain is a record of "region C needs to be replaced by region D" (step 4.3) - as long as that record exists, the Volume construction request can be modified correctly even if region C has been deleted.

Actually, I'm not sure this PR is needed if we're always deleting the region straight away, because the allocation CTE would bring us back to N (and N wouldn't have to change). The way I've written the whole region replacement process right now assumes that the old region sticks around in the DB (even if the actual physical disk is gone) until the very end, where it goes through the regular volume delete saga flow. That volume delete call could be pushed up much closer to the beginning of the process.

Pretending that we have 3 -> 4 regions when we're actually moving from 2 -> 3 just seems a little misleading -- like, the upstairs is actually operating at reduced capacity, but it doesn't know.

Even without any region replacement logic in Nexus, the Upstairs (if it is running) does know it's operating at reduced capacity if one of the downstairs is completely gone. If a Volume isn't attached anywhere, then this statement is true - that Volume has region sets that point to regions that are gone and this will only be "known" when that Volume is Volume::constructed somewhere.

Related to this - I don't think we've built any mechanism to "update existing instances" that their regions might have changed, right?

That's coming next haha :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we see with your test below - I think that as long as the region allocation CTE "does the right thing" when we go from 2 -> 3 or from 3 -> 4, I'm happy. That gives us flexibility to construct either situation, and it lets us call to the Upstairs with either option.

Thanks for the explanation, this plan seems reasonable to me. Seems like we have the building blocks to update existing instances, we just need to call 'em.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a drive by comment to thank you for the wonderful description @jmpesp

nexus/db-queries/src/db/queries/region_allocation.rs Outdated Show resolved Hide resolved
Comment on lines +2392 to +2393
// Confirm that a region set can start at N, a region can be deleted, and the
// allocation CTE can bring the redundancy back to N.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love to see it, this mitigates any fears I had about flexibility from the Nexus side!

@jmpesp jmpesp enabled auto-merge (squash) April 3, 2024 20:13
@jmpesp jmpesp merged commit 36f435b into oxidecomputer:main Apr 3, 2024
15 checks passed
@jmpesp jmpesp deleted the arbitrary_region_allocation_redundancy branch April 3, 2024 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make region allocation redundancy arbitrary
3 participants