Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify "affinity request" in policy doc comment #7649

Open
david-crespo opened this issue Feb 26, 2025 · 4 comments
Open

Clarify "affinity request" in policy doc comment #7649

david-crespo opened this issue Feb 26, 2025 · 4 comments

Comments

@david-crespo
Copy link
Contributor

david-crespo commented Feb 26, 2025

These doc comments shows up in the docs site and I think it could be clearer for the end user. I also want to understand it well myself so we can get the copy right in the console and in the hand-written guides. My understanding from a quick look at the code is that "affinity request" here means something like "whenever we pick an instance for a sled", which as far as I can tell is the first time an instance is started, and possibly future starts if the sled reservation goes away on instance stop (I couldn't tell whether it does). I think "affinity request" could be misread by users to mean the request in which we add an instance to an affinity group, which is completely different from instance start time.

I am happy to write the copy and make the PR myself, I just need help getting clear on the facts. @smklein

Image

/// Affinity policy used to describe "what to do when a request cannot be satisfied"
///
/// Used for both Affinity and Anti-Affinity Groups
#[derive(Clone, Copy, Debug, Deserialize, Serialize, PartialEq, JsonSchema)]
#[serde(rename_all = "snake_case")]
pub enum AffinityPolicy {
/// If the affinity request cannot be satisfied, allow it anyway.
///
/// This enables a "best-effort" attempt to satisfy the affinity policy.
Allow,
/// If the affinity request cannot be satisfied, fail explicitly.
Fail,
}

@smklein
Copy link
Collaborator

smklein commented Feb 26, 2025

My understanding from a quick look at the code is that "affinity request" here means something like "whenever we pick an instance for a sled", which as far as I can tell is the first time an instance is started, and possibly future starts if the sled reservation goes away on instance stop (I couldn't tell whether it does).

This is largely correct, we call reserve_vmm_resources, which bottoms out in this function, during instance start. It also gets called during instance migration.

I think "affinity request" could be misread by users to mean the request in which we add an instance to an affinity group, which is completely different from instance start time.

Agreed, I welcome clearer language here. Maybe we could say "If the affinity constraints cannot be satisfied when placing an instance on a sled..."?

@david-crespo
Copy link
Contributor Author

We don't necessarily have to say "when placing on a sled" in this spot — I think docs and console copy and observing the behavior of the system will be more load-bearing for getting the details across. How about being concrete about instance start being the relevant thing:

Allow instances to start even if affinity constraints cannot be satisfied.

Instance start will fail if affinity constraints cannot be satisfied.

Does the instance actually go to failed state, or does it stay in stopped?

@smklein
Copy link
Collaborator

smklein commented Feb 26, 2025

Does the instance actually go to failed state, or does it stay in stopped?

reserve_vmm_resources is one of the first steps in the "instance start" saga. I think the state of the instance is "stopped", and remains that way, if we cannot fulfill the affinity request (with policy = fail).

async fn sis_alloc_server(
sagactx: NexusActionContext,
) -> Result<SledUuid, ActionError> {
let osagactx = sagactx.user_data();
let params = sagactx.saga_params::<Params>()?;
let hardware_threads = params.db_instance.ncpus.0;
let reservoir_ram = params.db_instance.memory;
let propolis_id = sagactx.lookup::<PropolisUuid>("propolis_id")?;
let resource = super::instance_common::reserve_vmm_resources(
osagactx.nexus(),
InstanceUuid::from_untyped_uuid(params.db_instance.id()),
propolis_id,
u32::from(hardware_threads.0),
reservoir_ram,
db::model::SledReservationConstraints::none(),
)
.await?;
Ok(resource.sled_id.into())
}

IMO this also makes sense, re-trying the request might succeed, if other instances have been started/stopped, or hardware has changed.

@david-crespo
Copy link
Contributor Author

We may want to say "Instance will not start" instead of some variation on fail to stay kind of vague while avoiding the expectation that it will land in failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants