Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup the VMM reservoir in a background task #5124

Merged
merged 35 commits into from
Mar 7, 2024

Conversation

andrewjstone
Copy link
Contributor

Fixes the first part of #5121

@andrewjstone
Copy link
Contributor Author

We almost certainly want to test this on real hardware and try provisioning a few instances.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the Matrix discussion that led up to this change, but I'm not sure I understand where it all landed, so forgive me if I missed something: does this change need some extra synchronization in the InstanceManager to ensure that the reservoir is actually allocated before trying to start any instances on this sled? ISTM that absent any other synchronization, you could hit the following race:

  1. Sled agent starts up, creates the reservoir manager
  2. Reservoir manager spawns its background task thread, but for some reason this doesn't get scheduled right away
  3. Sled agent notifies Nexus that this sled is ready to rock and roll
  4. Nexus allocates a new VM to this sled
  5. VM startup fails because the new Propolis sees that there's no reservoir space allocated (because the background thread never got a chance to run)

Assuming this is a real issue, I think one easy way to work around this would be to have the manager include a tokio::sync::watch that indicates whether the reservoir thread has completed its initial reservoir setup; anyone who wants to create an instance can use watch::Receiver::wait_for to make sure the reservoir is properly established before continuing.

Again, forgive my ignorance if there's another plan of record here or if I've overlooked something--just wanted to make sure this scenario was accounted for :)

sled-agent/src/vmm_reservoir.rs Outdated Show resolved Hide resolved
sled-agent/src/vmm_reservoir.rs Outdated Show resolved Hide resolved
sled-agent/src/vmm_reservoir.rs Outdated Show resolved Hide resolved
sled-agent/src/vmm_reservoir.rs Outdated Show resolved Hide resolved
sled-agent/src/vmm_reservoir.rs Outdated Show resolved Hide resolved
@andrewjstone
Copy link
Contributor Author

I read the Matrix discussion that led up to this change, but I'm not sure I understand where it all landed, so forgive me if I missed something: does this change need some extra synchronization in the InstanceManager to ensure that the reservoir is actually allocated before trying to start any instances on this sled? ISTM that absent any other synchronization, you could hit the following race:

  1. Sled agent starts up, creates the reservoir manager
  2. Reservoir manager spawns its background task thread, but for some reason this doesn't get scheduled right away
  3. Sled agent notifies Nexus that this sled is ready to rock and roll
  4. Nexus allocates a new VM to this sled
  5. VM startup fails because the new Propolis sees that there's no reservoir space allocated (because the background thread never got a chance to run)

Assuming this is a real issue, I think one easy way to work around this would be to have the manager include a tokio::sync::watch that indicates whether the reservoir thread has completed its initial reservoir setup; anyone who wants to create an instance can use watch::Receiver::wait_for to make sure the reservoir is properly established before continuing.

Again, forgive my ignorance if there's another plan of record here or if I've overlooked something--just wanted to make sure this scenario was accounted for :)

I was actually planning on doing exactly this before opening the code but forgot. I wasn't sure where to use it exactly, but presumably it should be done for all instance launches to keep the semantics we have today, right?

@andrewjstone
Copy link
Contributor Author

Thanks so much for the great reviews @jgallagher @gjcolombo

@gjcolombo
Copy link
Contributor

I wasn't sure where to use it exactly

The reservoir needs to be ready to go no later than the Propolis instance_ensure call in sled_agent::InstanceRunner::propolis_ensure_inner. You can do the check anywhere along the path that leads to that function, but I would at least try to get it into the InstanceRunner logic somewhere if possible, since that's where all the other interesting decisions about when and how to talk to Propolis happen.

but presumably it should be done for all instance launches to keep the semantics we have today, right?

Yeah, as things currently stand this needs to happen for all attempts to ask Propolis to start a new VM.

For background, today the sled allocation flow works like this:

  • Every sled has a reservoir_ram column in the sled table that gives the total size of its reservoir
  • Every running VMM has a row in sled_resource that identifies what sled it belongs to and how much reservoir space it's using
  • When you ask to start an instance, the start saga atomically selects a sled that can fit the instance and writes a new sled_resource entry for the new VMM. (IIRC, more specifically the query is a CTE that finds all the sleds for which "sled's current reservoir usage" + "new instance's reservoir usage" <= "sled's total reservoir size", picks one at random, and then uses that as the sled ID in a new row in sled_resource.)

So, if you're starting up an instance on a sled, it's because Nexus has already determined that there's going to be space there--there's not really a back-and-forth between Nexus and the sleds to determine who might currently have enough space, whose reservoir is ready, etc.

We could change this over time if we wanted (e.g. have the sleds periodically post their current reservoir sizes so that Nexus will only try to allocate to a sled once it has room; but then you have to make sure that sleds won't shrink the reservoir without explicitly being asked, etc.). But for now it's simplest just to wait for the reservoir to reach the correct size.


One additional case to consider here is what to do if a sled fails to reach the expected reservoir size. If the reservoir is just smaller than expected, then starting the VM might work anyway, so maybe we should try it? But eventually it's probably best to tell Nexus what the actual reservoir size ended up being, so that it doesn't try to send VMs to a sled where we know they're not going to be able to run.

@andrewjstone
Copy link
Contributor Author

So, if you're starting up an instance on a sled, it's because Nexus has already determined that there's going to be space there--there's not really a back-and-forth between Nexus and the sleds to determine who might currently have enough space, whose reservoir is ready, etc.

This makes sense, and follows how our system works elsewhere. I don't think theres a real need to change it unless I"m missing something else.

One additional case to consider here is what to do if a sled fails to reach the expected reservoir size. If the reservoir is just smaller than expected, then starting the VM might work anyway, so maybe we should try it? But eventually it's probably best to tell Nexus what the actual reservoir size ended up being, so that it doesn't try to send VMs to a sled where we know they're not going to be able to run.

I definitely thought about this, and then .. unceremoniously punted... In the current main code there is only the onetime allocation, and if it fails it returns an error and sled-agent fails to start. With this change we log it, but don't fail to start. Presumably we could panic, but there should be a way to alert nexus instead. I don't think we are much worse off this way.

For the new API that allows changing the size of the reservoir, it is synchronous, and failure is returned, although its not currently in use anywhere.

@gjcolombo
Copy link
Contributor

gjcolombo commented Feb 24, 2024

This makes sense, and follows how our system works elsewhere. I don't think theres a real need to change it unless I"m missing something else.

Yep, I think you're right. I was spitballing a bit based on some of the discussion I saw in chat (but didn't clearly say so). We can (and should, I think) keep this behavior for this PR.

I definitely thought about this, and then .. unceremoniously punted... In the current main code there is only the onetime allocation, and if it fails it returns an error and sled-agent fails to start. With this change we log it, but don't fail to start. Presumably we could panic, but there should be a way to alert nexus instead. I don't think we are much worse off this way.

If I'm reading things right, if you try to start an instance on a sled where this has happened, the start sequence will fail in the propolis_ensure_inner step (Propolis will get ENOSPC back from bhyve when trying to allocate memory for the guest and will return an error to sled agent, and it'll bubble up from there). I'm 90% sure that will cause your start saga to unwind and your start attempt to fail, though with what external error code I'm not at all certain.

To be honest, I think this is materially worse (ETA: in degree, though maybe not in overall pain caused, see below) than just having the sled agent panic. If a sled never registers itself with Nexus, Nexus will never try to schedule an instance there. But if you have a sled that's in this state, and it's a viable allocation target, users might find that their instances randomly fail to start sometimes, that trying to start them again works sometimes (if the Wheel of Sleds lands on a different, non-broken sled), and that at other times the instance never starts (if the only available sleds are broken ones).

That said, if the probability of this is very low, it's probably not worth holding this change for it. But I think we should at least put a TODO(#5121) comment on the "failed to allocate reservoir" code path as a reminder to revisit this (and we should make a note in that issue of how we plan to address this once we have that plan).

@andrewjstone
Copy link
Contributor Author

andrewjstone commented Feb 26, 2024

To be honest, I think this is materially worse (ETA: in degree, though maybe not in overall pain caused, see below) than just having the sled agent panic. If a sled never registers itself with Nexus, Nexus will never try to schedule an instance there. But if you have a sled that's in this state, and it's a viable allocation target, users might find that their instances randomly fail to start sometimes, that trying to start them again works sometimes (if the Wheel of Sleds lands on a different, non-broken sled), and that at other times the instance never starts (if the only available sleds are broken ones).

@gjcolombo I thought more about this over the weekend, and it really feels to me like we should separate the desired state of the reservoir from the actual state of the reservoir. There is inherently a TOCTOU in setting it from nexus, and it being reflected in the sled-agent. If we switch to the strategy you mentioned of having sled-agent notify nexus when the size changes and only allow nexus to change it, then we would get rid of this issue altogether. It feels like we have to do this anyway if we want to allow changing the reservoir at runtime from Nexus. Otherwise we'd run into the same issue here where the reservoir allocation could fail causing the instance start to fail.

If we did things this way, it means that Nexus would get notified at start up of sled-agent having a zero size reservoir, but the default desired state of 80%. When allocation was done, sled-agent would notify nexus of the condition or a failure to allocate. I think it would be useful to encode the failure in CRDB, for lack of a better place right now, but I'm not sure if that would be universally agreed upon. I think this would effectively make the external client behavior what it is now with regards to instance allocation, at the cost of more complexity. I'm more than happy to make this change in a follow up if we agree.

@andrewjstone
Copy link
Contributor Author

To be honest, I think this is materially worse (ETA: in degree, though maybe not in overall pain caused, see below) than just having the sled agent panic. If a sled never registers itself with Nexus, Nexus will never try to schedule an instance there. But if you have a sled that's in this state, and it's a viable allocation target, users might find that their instances randomly fail to start sometimes, that trying to start them again works sometimes (if the Wheel of Sleds lands on a different, non-broken sled), and that at other times the instance never starts (if the only available sleds are broken ones).

@gjcolombo I thought more about this over the weekend, and it really feels to me like we should separate the desired state of the reservoir from the actual state of the reservoir. There is inherently a TOCTOU in setting it from nexus, and it being reflected in the sled-agent. If we switch to the strategy you mentioned of having sled-agent notify nexus when the size changes and only allow nexus to change it, then we would get rid of this issue altogether. It feels like we have to do this anyway if we want to allow changing the reservoir at runtime from Nexus. Otherwise we'd run into the same issue here where the reservoir allocation could fail causing the instance start to fail.

If we did things this way, it means that Nexus would get notified at start up of sled-agent having a zero size reservoir, but the default desired state of 80%. When allocation was done, sled-agent would notify nexus of the condition or a failure to allocate. I think it would be useful to encode the failure in CRDB, for lack of a better place right now, but I'm not sure if that would be universally agreed upon. I think this would effectively make the external client behavior what it is now with regards to instance allocation, at the cost of more complexity. I'm more than happy to make this change in a follow up if we agree.

Hmm, Literally right after I wrote this comment out, I realized the reservoir has to be reallocated on each sled-agent startup. This is a particular problem on sled reboot. Don't we already have the problem after RSS where Nexus knows of the sled and will try to allocate instanced to it even if it restarts and the allocation fails? In that case, with our current code, sled-agent will be down due to the failed allocation and will not respond to requests. With the new code we'd get a different level of failure. But Nexus would still try to allocate to that sled in either case due to lack of fault detection machinery.

@gjcolombo
Copy link
Contributor

@andrewjstone I agree with both your comments. It definitely makes sense to me to decouple "this is the size Nexus thinks the reservoir should be given what the sled has said about its physical memory capacity" from "this is the last reservoir size sled agent reported" and have instance placement decisions depend only the latter figure.

In that case, with our current code, sled-agent will be down due to the failed allocation and will not respond to requests. With the new code we'd get a different level of failure. But Nexus would still try to allocate to that sled in either case due to lack of fault detection machinery.

This seems right to me. In fact I'd say these are both specific cases of a more general problem where Nexus thinks that some sled is a viable host for an instance, but in fact it's not for $REASONS. We should certainly handle this problem more robustly, but I don't think I'd make splitting the desired/actual reservoir states contingent on fixing it, unless I've missed something and splitting those responsibilities will make that problem significantly worse.

@andrewjstone
Copy link
Contributor Author

@andrewjstone I agree with both your comments. It definitely makes sense to me to decouple "this is the size Nexus thinks the reservoir should be given what the sled has said about its physical memory capacity" from "this is the last reservoir size sled agent reported" and have instance placement decisions depend only the latter figure.

YAY!!!

This seems right to me. In fact I'd say these are both specific cases of a more general problem where Nexus thinks that some sled is a viable host for an instance, but in fact it's not for $REASONS. We should certainly handle this problem more robustly, but I don't think I'd make splitting the desired/actual reservoir states contingent on fixing it, unless I've missed something and splitting those responsibilities will make that problem significantly worse.

Totally agree this seems like a more general problem, and the split won't make it any worse. I just wanted to point it out to show that we are already kinda screwed but can improve incrementally and fix what you mentioned.

@smklein
Copy link
Collaborator

smklein commented Feb 26, 2024

I realized the reservoir has to be reallocated on each sled-agent startup. This is a particular problem on sled reboot. Don't we already have the problem after RSS where Nexus knows of the sled and will try to allocate instanced to it even if it restarts and the allocation fails?

I think it's possible for sleds to disappear at any time, from Nexus' point-of-view -- I think the concern I have is "could this request appear to succeed, but actually fail later because of a condition that we aren't able to overcome (reservoir allocation failure).

Namely, if we return "instance launched successfully", but we know we're over-provisioning the reservoir, that seems like it'll just make a large gap between "expected invariant violated" (overprovisioning) and "observed misbehavior" (misbehaving behavior in guest, due to insufficient memory, maybe).

In contrast, if we fail to put an instance on a sled because the sled is offline we at least can unwind the stack, give a reason to the user ("sled didn't respond!") and retry somewhere else. That at least gives an operator a chance to intervene, and IMO doesn't present as a harder-to-understand error later due to violated system constraints.


/// Return a [`tokio::sync::watch::Receiver`] whose value changes when the
/// reservoir size is updated.
#[allow(unused)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used now, right?

@@ -1179,6 +1189,20 @@ impl InstanceRunner {
&mut self,
migration_params: Option<InstanceMigrationTargetParams>,
) -> Result<(), Error> {
// Wait for enough allocated reservoir space for at least this VM.
Copy link
Collaborator

@smklein smklein Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned by this, since it feels kinda like a partial solution? Yes, this would be correct if we only provisioned a single VM on this sled, but that case seems unlikely to me.

Should we just wait for the entire reservoir to be "provisioned according to what Nexus expects", for now?

Example with this "wait for single VM's worth of RAM" case that I think would be bad

  • Sled Agent boots, starts prepping reservoir (slowly). Assume total reservoir size of 1 TiB.
  • Nexus tries to provision 50 VMs to this sled. Each wants to use 10 GiB of RAM.
  • At the time Nexus sends the "instance ensure" requests to the sled agent, the reservoir is only 1% set-up. There's 10 GiB RAM total in there.
  • This check succeeds!

In this case: we over-provision the sled! The VMs might misbehave, since they're operating with a reservoir that doesn't have enough space.

If we're concerned about Nexus being able to successfully provision instances to sleds quickly -- this can still fail! From a user's perspective, this would look like "I tried to start a VM, Nexus thought it was ready, the sled said no, so my allocation failed". So it's kinda a roll of the dice on "does my request fail because nexus was too eager", or "does my request succeed but violate system variants because sled agent is too eager".

Alternatives?

There seem like a few other options here that avoid this issue, but they have other tradeoffs.

Option 1: The Instance Manager refuses to provision VMs until the reservoir is "at least the size that Nexus expected it to be". This means that we'd never overprovision, but we would make it more likely for instance ensure requests to fail during boot. This could be mitigated in a few ways - blocking the instance provision request until the reservoir is sized, using HTTP error codes and a retry loop during Nexus instance provision saga, etc.

Option 2: The "Instance Manager" part of Sled Agent refuses to be online until it gets this sizing information from Nexus, and as a part of that handshake, Nexus updates how it perceives the Sled. The interaction could look like:

  • Sled Agent -> Nexus: I'm booting, what size should my reservoir be?
  • Nexus -> Sled Agent: It should be XXX % of your RAM. I'm going to flag you as booting (temporarily non-provisionable).
  • Sled Agent -> Nexus: My reservoir is ready! (Flag self as "no longer booting" for provisioning).

There are certainly other options in this space too. But I think that we should be very careful to avoid overprovisioning, even in boot-only edge cases, if that's a system property we claim to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned by this, since it feels kinda like a partial solution? Yes, this would be correct if we only provisioned a single VM on this sled, but that case seems unlikely to me.

Should we just wait for the entire reservoir to be "provisioned according to what Nexus expects", for now?

I treated this check as a proxy for the entire reservoir being ready. The reservoir doesn't report getting partially filled right now and the size only gets set once at RSS time. However, I fully agree with what you said that this is only a partial solution and Nexus can increase the reservoir size. The old size may still be big enough for each individual new instance, but not big enough for all instances intended to run on the sled. Your option 1, would fix this immediate overprovisioning problem. This is also the reason I proposed splitting the current reservoir allocation into "what Nexus wants the next reservation to be" and "what the sled-agent tells it it has allocated". If the real change in size hasn't reached Nexus yet, then nexus won't provision any extra VMs. Nexus will always know how much total space it has and what it's capable of provisioning. This is similar to your option 2.

Unfortunately, the above solution that splits between desired and actual states of the reservoir doesn't work when a sled is rebooted, as reservoir allocation may fail on the reboot and nexus will not know about it until after it tries to allocate a VM and it fails because the reservoir hasn't been allocated. I think this is also a problem with Option 2, as there is a TOCTOU where the sled-agent can say "hey I'm ready", and then crash and restart and fail allocation. All in all, I think I prefer option 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't think sled agent will end up overprovisioning memory for VMs if an instance start request arrives while the reservoir is smaller than Nexus expects: if I understand bhyve's behavior correctly, attempting to start a VM with insufficient memory in the reservoir will fail outright and won't use any non-reservoir memory. The instance has still failed to start, of course, and we have to decide how to deal with that, but I don't think the change as written will overcommit sled memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, okay, that's good to know! The problem still exists in the form of "will this instance provisioning request fail in a visible way for a client", but that's good that we wouldn't end up violating that constraint either way.

@andrewjstone
Copy link
Contributor Author

I'm changing a bunch of stuff related to pushing notifications to nexus, so turning this back into a draft temporarily so I can push up wip changes.

@andrewjstone andrewjstone marked this pull request as ready for review March 1, 2024 23:54
@andrewjstone andrewjstone marked this pull request as draft March 1, 2024 23:58
@andrewjstone andrewjstone marked this pull request as ready for review March 5, 2024 20:01
@andrewjstone
Copy link
Contributor Author

I believe this PR is ready for a real review now. It still needs testing on real hardware, and also, maybe a nexus test to ensure sled-picking behavior changes. Looking at the code though, I can't see how it won't do the right thing, and this PR is getting relatively large already.

@andrewjstone
Copy link
Contributor Author

John and I tested this on madrid. We see the generation number getting bumped as expected and the reservoir starting at size 0 and then increasing in value after 2 minutes. Instances deployed to the sled fine after that.

Copy link
Contributor

@gjcolombo gjcolombo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great--thanks for putting the work in on the notification worker (and for taking the time to discuss it with me on Matrix!). I have a few small nits/questions but LGTM overall.

nexus/db-model/src/sled.rs Outdated Show resolved Hide resolved
nexus/db-queries/src/db/datastore/sled.rs Outdated Show resolved Hide resolved
sled-agent/src/nexus.rs Outdated Show resolved Hide resolved
sled-agent/src/nexus.rs Show resolved Hide resolved
sled-agent/src/nexus.rs Outdated Show resolved Hide resolved
sled-agent/src/nexus.rs Show resolved Hide resolved
sled-agent/src/vmm_reservoir.rs Show resolved Hide resolved
@andrewjstone
Copy link
Contributor Author

A couple of things still to do after my most recent changes before I merge this in. I added a new column to the sled-table, so I need to test again on madrid. Ideally I'd test an upgrade to make sure nothing breaks, followed by the same testing done earlier today.

Also, I changed the code to in sled_upsert in Nexus to error on a decommissioned sled, but the code in sled-agent will keep trying and erroring currently. I need to ensure on an error, after a Get request for SledAgentInfo that the sled-agent just turns itself into a pumpkin.

@andrewjstone
Copy link
Contributor Author

Sled-agent now handles decommissioned sleds in the NexusNotifierTask. Ready for testing again.

@andrewjstone
Copy link
Contributor Author

John and I successfully tested all the changes. Hopefully I'll get this built and committed before yet another merge conflict.

@andrewjstone andrewjstone enabled auto-merge (squash) March 7, 2024 22:41
@andrewjstone andrewjstone merged commit 87c9b13 into main Mar 7, 2024
22 checks passed
@andrewjstone andrewjstone deleted the ajs/vmm-reservoir-bg-thread branch March 7, 2024 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants