Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup the VMM reservoir in a background task #5124

Merged
merged 35 commits into from
Mar 7, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
1d0ed8c
Setup the VMM reservoir in a background task
andrewjstone Feb 22, 2024
bbbe079
comments
andrewjstone Feb 22, 2024
bdea2f6
review fixes
andrewjstone Feb 23, 2024
710b0ce
Wait for VMM reservoir allocation when ensuring propolis instances
andrewjstone Feb 26, 2024
233f2d1
Revert "Wait for VMM reservoir allocation when ensuring propolis inst…
andrewjstone Feb 27, 2024
bedf0c1
remove watch from vmm_reservoir
andrewjstone Feb 27, 2024
7bdf2e3
Add sled_agent_get to nexus internal API
andrewjstone Feb 27, 2024
7c4a2de
wip
andrewjstone Feb 28, 2024
58a84fa
wip
andrewjstone Feb 28, 2024
f6116b6
wip
andrewjstone Feb 28, 2024
cf185cf
wip
andrewjstone Feb 28, 2024
7dbcedc
wip
andrewjstone Feb 29, 2024
7429a8d
wip
andrewjstone Feb 29, 2024
c6a396c
wip
andrewjstone Feb 29, 2024
32cd05b
wip
andrewjstone Feb 29, 2024
e7ce5af
wip
andrewjstone Feb 29, 2024
00d248e
wip
andrewjstone Feb 29, 2024
adc29dc
Solid test for sled-agent
andrewjstone Feb 29, 2024
3418a60
Make sled upserts conditional on rcgen
andrewjstone Mar 1, 2024
0488749
extra rcgen check
andrewjstone Mar 1, 2024
8a7d297
plumb through reservoir size changes
andrewjstone Mar 4, 2024
f379afb
typo
andrewjstone Mar 4, 2024
9c130e2
Merge branch 'main' into ajs/vmm-reservoir-bg-thread
andrewjstone Mar 4, 2024
f49208e
Return an result for sled upserts, with error for decommissioned
andrewjstone Mar 5, 2024
5b1914b
fix warnings
andrewjstone Mar 5, 2024
e5795c9
clippy
andrewjstone Mar 5, 2024
8d1ae0c
Merge branch 'main' into ajs/vmm-reservoir-bg-thread
andrewjstone Mar 5, 2024
0f904f4
Give sled-agent its own generation number
andrewjstone Mar 7, 2024
4fd1afd
review fixes
andrewjstone Mar 7, 2024
edbf0a0
Merge branch 'main' into ajs/vmm-reservoir-bg-thread
andrewjstone Mar 7, 2024
5b01ef0
Handle sled decommissioning
andrewjstone Mar 7, 2024
332832e
Merge branch 'main' into ajs/vmm-reservoir-bg-thread
andrewjstone Mar 7, 2024
e9167a8
type fix for Baseboard
andrewjstone Mar 7, 2024
c611a81
COMMENTS
andrewjstone Mar 7, 2024
9027e5b
Merge branch 'main' into ajs/vmm-reservoir-bg-thread
andrewjstone Mar 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions sled-agent/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ dpd-client.workspace = true
display-error-chain.workspace = true
dropshot.workspace = true
flate2.workspace = true
flume.workspace = true
futures.workspace = true
glob.workspace = true
hex.workspace = true
Expand Down
26 changes: 25 additions & 1 deletion sled-agent/src/instance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ use illumos_utils::svc::wait_for_service;
use illumos_utils::zone::Zones;
use illumos_utils::zone::PROPOLIS_ZONE_PREFIX;
use omicron_common::address::NEXUS_INTERNAL_PORT;
use omicron_common::api::external::ByteCount;
use omicron_common::api::internal::nexus::{
InstanceRuntimeState, SledInstanceState, VmmRuntimeState,
};
Expand All @@ -49,7 +50,7 @@ use slog::Logger;
use std::net::IpAddr;
use std::net::{SocketAddr, SocketAddrV6};
use std::sync::Arc;
use tokio::sync::{mpsc, oneshot};
use tokio::sync::{mpsc, oneshot, watch};
use uuid::Uuid;

// The depth of the request queue for the instance.
Expand Down Expand Up @@ -125,6 +126,9 @@ pub enum Error {

#[error("Instance dropped our request")]
RequestDropped(#[from] oneshot::error::RecvError),

#[error("VMM reservoir watch dropped")]
VmmReservoirWatchDropped(#[from] watch::error::RecvError),
}

// Issues read-only, idempotent HTTP requests at propolis until it responds with
Expand Down Expand Up @@ -354,6 +358,10 @@ struct InstanceRunner {

// Object representing membership in the "instance manager".
instance_ticket: InstanceTicket,

// Used to ensure the reservoir is allocated with enough capacity
// for starting an instance.
vmm_reservoir_watch: watch::Receiver<Option<ByteCount>>,
}

impl InstanceRunner {
Expand Down Expand Up @@ -948,6 +956,7 @@ impl Instance {
storage,
zone_bundler,
zone_builder_factory,
vmm_reservoir_watch,
} = services;

let mut dhcp_config = DhcpCfg {
Expand Down Expand Up @@ -1028,6 +1037,7 @@ impl Instance {
zone_builder_factory,
zone_bundler,
instance_ticket: ticket,
vmm_reservoir_watch,
};

let runner_handle =
Expand Down Expand Up @@ -1179,6 +1189,20 @@ impl InstanceRunner {
&mut self,
migration_params: Option<InstanceMigrationTargetParams>,
) -> Result<(), Error> {
// Wait for enough allocated reservoir space for at least this VM.
Copy link
Collaborator

@smklein smklein Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned by this, since it feels kinda like a partial solution? Yes, this would be correct if we only provisioned a single VM on this sled, but that case seems unlikely to me.

Should we just wait for the entire reservoir to be "provisioned according to what Nexus expects", for now?

Example with this "wait for single VM's worth of RAM" case that I think would be bad

  • Sled Agent boots, starts prepping reservoir (slowly). Assume total reservoir size of 1 TiB.
  • Nexus tries to provision 50 VMs to this sled. Each wants to use 10 GiB of RAM.
  • At the time Nexus sends the "instance ensure" requests to the sled agent, the reservoir is only 1% set-up. There's 10 GiB RAM total in there.
  • This check succeeds!

In this case: we over-provision the sled! The VMs might misbehave, since they're operating with a reservoir that doesn't have enough space.

If we're concerned about Nexus being able to successfully provision instances to sleds quickly -- this can still fail! From a user's perspective, this would look like "I tried to start a VM, Nexus thought it was ready, the sled said no, so my allocation failed". So it's kinda a roll of the dice on "does my request fail because nexus was too eager", or "does my request succeed but violate system variants because sled agent is too eager".

Alternatives?

There seem like a few other options here that avoid this issue, but they have other tradeoffs.

Option 1: The Instance Manager refuses to provision VMs until the reservoir is "at least the size that Nexus expected it to be". This means that we'd never overprovision, but we would make it more likely for instance ensure requests to fail during boot. This could be mitigated in a few ways - blocking the instance provision request until the reservoir is sized, using HTTP error codes and a retry loop during Nexus instance provision saga, etc.

Option 2: The "Instance Manager" part of Sled Agent refuses to be online until it gets this sizing information from Nexus, and as a part of that handshake, Nexus updates how it perceives the Sled. The interaction could look like:

  • Sled Agent -> Nexus: I'm booting, what size should my reservoir be?
  • Nexus -> Sled Agent: It should be XXX % of your RAM. I'm going to flag you as booting (temporarily non-provisionable).
  • Sled Agent -> Nexus: My reservoir is ready! (Flag self as "no longer booting" for provisioning).

There are certainly other options in this space too. But I think that we should be very careful to avoid overprovisioning, even in boot-only edge cases, if that's a system property we claim to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned by this, since it feels kinda like a partial solution? Yes, this would be correct if we only provisioned a single VM on this sled, but that case seems unlikely to me.

Should we just wait for the entire reservoir to be "provisioned according to what Nexus expects", for now?

I treated this check as a proxy for the entire reservoir being ready. The reservoir doesn't report getting partially filled right now and the size only gets set once at RSS time. However, I fully agree with what you said that this is only a partial solution and Nexus can increase the reservoir size. The old size may still be big enough for each individual new instance, but not big enough for all instances intended to run on the sled. Your option 1, would fix this immediate overprovisioning problem. This is also the reason I proposed splitting the current reservoir allocation into "what Nexus wants the next reservation to be" and "what the sled-agent tells it it has allocated". If the real change in size hasn't reached Nexus yet, then nexus won't provision any extra VMs. Nexus will always know how much total space it has and what it's capable of provisioning. This is similar to your option 2.

Unfortunately, the above solution that splits between desired and actual states of the reservoir doesn't work when a sled is rebooted, as reservoir allocation may fail on the reboot and nexus will not know about it until after it tries to allocate a VM and it fails because the reservoir hasn't been allocated. I think this is also a problem with Option 2, as there is a TOCTOU where the sled-agent can say "hey I'm ready", and then crash and restart and fail allocation. All in all, I think I prefer option 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't think sled agent will end up overprovisioning memory for VMs if an instance start request arrives while the reservoir is smaller than Nexus expects: if I understand bhyve's behavior correctly, attempting to start a VM with insufficient memory in the reservoir will fail outright and won't use any non-reservoir memory. The instance has still failed to start, of course, and we have to decide how to deal with that, but I don't think the change as written will overcommit sled memory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, okay, that's good to know! The problem still exists in the form of "will this instance provisioning request fail in a visible way for a client", but that's good that we wouldn't end up violating that constraint either way.

let mem_needed = self.properties.memory;
let val_ref = self
.vmm_reservoir_watch
.wait_for(|val| {
val.map_or(false, |size| {
size.to_whole_mebibytes() >= mem_needed
})
})
.await?;
// Drop the ref so we don't hold the lock on the watch and block
// the producer.
drop(val_ref);

if let Some(running_state) = self.running_state.as_ref() {
info!(
&self.log,
Expand Down
118 changes: 17 additions & 101 deletions sled-agent/src/instance_manager.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,24 +13,25 @@ use crate::params::{
InstanceHardware, InstanceMigrationSourceParams, InstancePutStateResponse,
InstanceStateRequested, InstanceUnregisterResponse,
};
use crate::vmm_reservoir::VmmReservoirManagerHandle;
use crate::zone_bundle::BundleError;
use crate::zone_bundle::ZoneBundler;
use omicron_common::api::external::ByteCount;

use anyhow::anyhow;
use illumos_utils::dladm::Etherstub;
use illumos_utils::link::VnicAllocator;
use illumos_utils::opte::PortManager;
use illumos_utils::running_zone::ZoneBuilderFactory;
use illumos_utils::vmm_reservoir;
use omicron_common::api::external::ByteCount;
use omicron_common::api::internal::nexus::InstanceRuntimeState;
use omicron_common::api::internal::nexus::SledInstanceState;
use omicron_common::api::internal::nexus::VmmRuntimeState;
use sled_storage::manager::StorageHandle;
use slog::Logger;
use std::collections::BTreeMap;
use std::net::SocketAddr;
use std::sync::{Arc, Mutex};
use std::sync::Arc;
use tokio::sync::watch;
use tokio::sync::{mpsc, oneshot};
use uuid::Uuid;

Expand All @@ -48,12 +49,6 @@ pub enum Error {
#[error("OPTE port management error: {0}")]
Opte(#[from] illumos_utils::opte::Error),

#[error("Failed to create reservoir: {0}")]
Reservoir(#[from] vmm_reservoir::Error),

#[error("Invalid reservoir configuration: {0}")]
ReservoirConfig(String),

#[error("Cannot find data link: {0}")]
Underlay(#[from] sled_hardware::underlay::Error),

Expand All @@ -72,32 +67,21 @@ pub enum Error {
RequestDropped(#[from] oneshot::error::RecvError),
}

pub enum ReservoirMode {
None,
Size(u32),
Percentage(u8),
}

pub(crate) struct InstanceManagerServices {
pub nexus_client: NexusClientWithResolver,
pub vnic_allocator: VnicAllocator<Etherstub>,
pub port_manager: PortManager,
pub storage: StorageHandle,
pub zone_bundler: ZoneBundler,
pub zone_builder_factory: ZoneBuilderFactory,
pub vmm_reservoir_watch: watch::Receiver<Option<ByteCount>>,
}

// Describes the internals of the "InstanceManager", though most of the
// instance manager's state exists within the "InstanceManagerRunner" structure.
struct InstanceManagerInternal {
log: Logger,
tx: mpsc::Sender<InstanceManagerRequest>,
// NOTE: Arguably, this field could be "owned" by the InstanceManagerRunner.
// It was not moved there, and the reservoir functions were not converted to
// use the message-passing interface (see: "InstanceManagerRequest") because
// callers of "get/set reservoir size" are not async, and (in the case of
// getting the size) they also do not expect a "Result" type.
reservoir_size: Mutex<ByteCount>,
vmm_reservoir_manager: VmmReservoirManagerHandle,

#[allow(dead_code)]
runner_handle: tokio::task::JoinHandle<()>,
Expand All @@ -110,6 +94,7 @@ pub struct InstanceManager {

impl InstanceManager {
/// Initializes a new [`InstanceManager`] object.
#[allow(clippy::too_many_arguments)]
pub fn new(
log: Logger,
nexus_client: NexusClientWithResolver,
Expand All @@ -118,6 +103,7 @@ impl InstanceManager {
storage: StorageHandle,
zone_bundler: ZoneBundler,
zone_builder_factory: ZoneBuilderFactory,
vmm_reservoir_manager: VmmReservoirManagerHandle,
) -> Result<InstanceManager, Error> {
let (tx, rx) = mpsc::channel(QUEUE_SIZE);
let (terminate_tx, terminate_rx) = mpsc::unbounded_channel();
Expand All @@ -135,98 +121,21 @@ impl InstanceManager {
storage,
zone_bundler,
zone_builder_factory,
vmm_reservoir_watch: vmm_reservoir_manager.watcher(),
};

let runner_handle =
tokio::task::spawn(async move { runner.run().await });

Ok(Self {
inner: Arc::new(InstanceManagerInternal {
log,
tx,
// no reservoir size set on startup
reservoir_size: Mutex::new(ByteCount::from_kibibytes_u32(0)),
vmm_reservoir_manager,
runner_handle,
}),
})
}

/// Sets the VMM reservoir to the requested percentage of usable physical
/// RAM or to a size in MiB. Either mode will round down to the nearest
/// aligned size required by the control plane.
pub fn set_reservoir_size(
&self,
hardware: &sled_hardware::HardwareManager,
mode: ReservoirMode,
) -> Result<(), Error> {
let hardware_physical_ram_bytes = hardware.usable_physical_ram_bytes();
let req_bytes = match mode {
ReservoirMode::None => return Ok(()),
ReservoirMode::Size(mb) => {
let bytes = ByteCount::from_mebibytes_u32(mb).to_bytes();
if bytes > hardware_physical_ram_bytes {
return Err(Error::ReservoirConfig(format!(
"cannot specify a reservoir of {bytes} bytes when \
physical memory is {hardware_physical_ram_bytes} bytes",
)));
}
bytes
}
ReservoirMode::Percentage(percent) => {
if !matches!(percent, 1..=99) {
return Err(Error::ReservoirConfig(format!(
"VMM reservoir percentage of {} must be between 0 and \
100",
percent
)));
};
(hardware_physical_ram_bytes as f64 * (percent as f64 / 100.0))
.floor() as u64
}
};

let req_bytes_aligned = vmm_reservoir::align_reservoir_size(req_bytes);

if req_bytes_aligned == 0 {
warn!(
self.inner.log,
"Requested reservoir size of {} bytes < minimum aligned size \
of {} bytes",
req_bytes,
vmm_reservoir::RESERVOIR_SZ_ALIGN
);
return Ok(());
}

// The max ByteCount value is i64::MAX, which is ~8 million TiB.
// As this value is either a percentage of DRAM or a size in MiB
// represented as a u32, constructing this should always work.
let reservoir_size = ByteCount::try_from(req_bytes_aligned).unwrap();
if let ReservoirMode::Percentage(percent) = mode {
info!(
self.inner.log,
"{}% of {} physical ram = {} bytes)",
percent,
hardware_physical_ram_bytes,
req_bytes,
);
}
info!(
self.inner.log,
"Setting reservoir size to {reservoir_size} bytes"
);
vmm_reservoir::ReservoirControl::set(reservoir_size)?;

*self.inner.reservoir_size.lock().unwrap() = reservoir_size;

Ok(())
}

/// Returns the last-set size of the reservoir
pub fn reservoir_size(&self) -> ByteCount {
*self.inner.reservoir_size.lock().unwrap()
}

pub async fn ensure_registered(
&self,
instance_id: Uuid,
Expand Down Expand Up @@ -379,6 +288,11 @@ impl InstanceManager {
.map_err(|_| Error::FailedSendInstanceManagerClosed)?;
rx.await?
}

andrewjstone marked this conversation as resolved.
Show resolved Hide resolved
/// Returns the last-set size of the reservoir
pub fn reservoir_size(&self) -> ByteCount {
self.inner.vmm_reservoir_manager.reservoir_size()
}
}

// Most requests that can be sent to the "InstanceManagerRunner" task.
Expand Down Expand Up @@ -472,6 +386,7 @@ struct InstanceManagerRunner {
storage: StorageHandle,
zone_bundler: ZoneBundler,
zone_builder_factory: ZoneBuilderFactory,
vmm_reservoir_watch: watch::Receiver<Option<ByteCount>>,
}

impl InstanceManagerRunner {
Expand Down Expand Up @@ -630,6 +545,7 @@ impl InstanceManagerRunner {
storage: self.storage.clone(),
zone_bundler: self.zone_bundler.clone(),
zone_builder_factory: self.zone_builder_factory.clone(),
vmm_reservoir_watch: self.vmm_reservoir_watch.clone(),
};

let state = crate::instance::InstanceInitialState {
Expand Down
1 change: 1 addition & 0 deletions sled-agent/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ mod smf_helper;
mod storage_monitor;
mod swap_device;
mod updates;
mod vmm_reservoir;
mod zone_bundle;

#[cfg(test)]
Expand Down
Loading
Loading