Clarify "affinity request" in policy doc comment #7649

david-crespo · 2025-02-26T17:32:12Z

These doc comments shows up in the docs site and I think it could be clearer for the end user. I also want to understand it well myself so we can get the copy right in the console and in the hand-written guides. My understanding from a quick look at the code is that "affinity request" here means something like "whenever we pick an instance for a sled", which as far as I can tell is the first time an instance is started, and possibly future starts if the sled reservation goes away on instance stop (I couldn't tell whether it does). I think "affinity request" could be misread by users to mean the request in which we add an instance to an affinity group, which is completely different from instance start time.

I am happy to write the copy and make the PR myself, I just need help getting clear on the facts. @smklein

omicron/common/src/api/external/mod.rs

Lines 1291 to 1304 in 665cf83

    
           /// Affinity policy used to describe "what to do when a request cannot be satisfied" 
        
           /// 
        
           /// Used for both Affinity and Anti-Affinity Groups 
        
           #[derive(Clone, Copy, Debug, Deserialize, Serialize, PartialEq, JsonSchema)] 
        
           #[serde(rename_all = "snake_case")] 
        
           pub enum AffinityPolicy { 
        
               /// If the affinity request cannot be satisfied, allow it anyway. 
        
               /// 
        
               /// This enables a "best-effort" attempt to satisfy the affinity policy. 
        
               Allow, 
        
               /// If the affinity request cannot be satisfied, fail explicitly. 
        
               Fail, 
        
           }

smklein · 2025-02-26T19:06:43Z

My understanding from a quick look at the code is that "affinity request" here means something like "whenever we pick an instance for a sled", which as far as I can tell is the first time an instance is started, and possibly future starts if the sled reservation goes away on instance stop (I couldn't tell whether it does).

This is largely correct, we call reserve_vmm_resources, which bottoms out in this function, during instance start. It also gets called during instance migration.

I think "affinity request" could be misread by users to mean the request in which we add an instance to an affinity group, which is completely different from instance start time.

Agreed, I welcome clearer language here. Maybe we could say "If the affinity constraints cannot be satisfied when placing an instance on a sled..."?

david-crespo · 2025-02-26T19:19:49Z

We don't necessarily have to say "when placing on a sled" in this spot — I think docs and console copy and observing the behavior of the system will be more load-bearing for getting the details across. How about being concrete about instance start being the relevant thing:

Allow instances to start even if affinity constraints cannot be satisfied.

Instance start will fail if affinity constraints cannot be satisfied.

Does the instance actually go to failed state, or does it stay in stopped?

smklein · 2025-02-26T19:27:01Z

Does the instance actually go to failed state, or does it stay in stopped?

reserve_vmm_resources is one of the first steps in the "instance start" saga. I think the state of the instance is "stopped", and remains that way, if we cannot fulfill the affinity request (with policy = fail).

omicron/nexus/src/app/sagas/instance_start.rs

Lines 155 to 175 in 71da009

    
           async fn sis_alloc_server( 
        
               sagactx: NexusActionContext, 
        
           ) -> Result<SledUuid, ActionError> { 
        
               let osagactx = sagactx.user_data(); 
        
               let params = sagactx.saga_params::<Params>()?; 
        
               let hardware_threads = params.db_instance.ncpus.0; 
        
               let reservoir_ram = params.db_instance.memory; 
        
               let propolis_id = sagactx.lookup::<PropolisUuid>("propolis_id")?; 
        
               let resource = super::instance_common::reserve_vmm_resources( 
        
                   osagactx.nexus(), 
        
                   InstanceUuid::from_untyped_uuid(params.db_instance.id()), 
        
                   propolis_id, 
        
                   u32::from(hardware_threads.0), 
        
                   reservoir_ram, 
        
                   db::model::SledReservationConstraints::none(), 
        
               ) 
        
               .await?; 
        
               Ok(resource.sled_id.into()) 
        
           }

IMO this also makes sense, re-trying the request might succeed, if other instances have been started/stopped, or hardware has changed.

david-crespo · 2025-02-26T19:33:14Z

We may want to say "Instance will not start" instead of some variation on fail to stay kind of vague while avoiding the expectation that it will land in failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify "affinity request" in policy doc comment #7649

Clarify "affinity request" in policy doc comment #7649

david-crespo commented Feb 26, 2025 •

edited

Loading

smklein commented Feb 26, 2025

david-crespo commented Feb 26, 2025

smklein commented Feb 26, 2025

david-crespo commented Feb 26, 2025

Clarify "affinity request" in policy doc comment #7649

Clarify "affinity request" in policy doc comment #7649

Comments

david-crespo commented Feb 26, 2025 • edited Loading

smklein commented Feb 26, 2025

david-crespo commented Feb 26, 2025

smklein commented Feb 26, 2025

david-crespo commented Feb 26, 2025

david-crespo commented Feb 26, 2025 •

edited

Loading