Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stateful deployments: use TaskGroupVolumeClaim table to associate volume requests with volume IDs #24993

Merged
merged 28 commits into from
Feb 7, 2025

Conversation

pkazmierczak
Copy link
Contributor

@pkazmierczak pkazmierczak commented Jan 31, 2025

We introduce an alternative solution to the one presented in #24960 which is based on the state store and not previous-next allocation tracking in the reconciler. This new solution reduces cognitive complexity of the scheduler code at the cost of slightly more boilerplate code, but also opens up new possibilities in the future, e.g., allowing users to explicitly "un-stick" volumes with workloads still running.

The diagram below illustrates the new logic:

     SetVolumes()                                               upsertAllocsImpl()          
     sets ns, job                             +-----------------checks if alloc requests    
     tg in the scheduler                      v                 sticky vols and consults    
            |                  +-----------------------+        state. If there is no claim,
            |                  | TaskGroupVolumeClaim: |        it creates one.             
            |                  | - namespace           |                                    
            |                  | - jobID               |                                    
            |                  | - tg name             |                                    
            |                  | - vol ID              |                                    
            v                  | uniquely identify vol |                                    
     hasVolumes()              +----+------------------+                                    
     consults the state             |           ^                                           
     and returns true               |           |               DeleteJobTxn()              
     if there's a match <-----------+           +---------------removes the claim from      
     or if there is no                                          the state                   
     previous claim                                                                         
|                             | |                                                      |    
+-----------------------------+ +------------------------------------------------------+    
                                                                                            
           scheduler                                  state store                           

Supersedes #24960
Fixes issues found in #24869

nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/fsm.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, @pkazmierczak!

scheduler/feasible.go Outdated Show resolved Hide resolved
nomad/structs/structs.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
nomad/state/state_store.go Outdated Show resolved Hide resolved
@pkazmierczak pkazmierczak self-assigned this Feb 5, 2025
@pkazmierczak pkazmierczak added this to the 1.10.0 milestone Feb 5, 2025
@pkazmierczak pkazmierczak marked this pull request as ready for review February 5, 2025 10:20
@pkazmierczak pkazmierczak requested a review from a team as a code owner February 5, 2025 10:20
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach looks good! One more item to clean up in the feasibility check, I think.

nomad/state/state_store_task_group_volume_claims.go Outdated Show resolved Hide resolved
scheduler/feasible.go Outdated Show resolved Hide resolved
nomad/state/state_store_task_group_volume_claims.go Outdated Show resolved Hide resolved
}

storedClaims, err := ctx.State().GetTaskGroupHostVolumeClaims(nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere along the way we dropped that this set of claims should only be the ones for this task group. This would be all claims for all task groups, which we don't care about.

scheduler/feasible.go Outdated Show resolved Hide resolved
scheduler/feasible_test.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple small items left to address and I think this is ready to merge.

scheduler/feasible.go Outdated Show resolved Hide resolved
scheduler/feasible.go Outdated Show resolved Hide resolved
scheduler/feasible_test.go Show resolved Hide resolved
nomad/state/state_store_test.go Outdated Show resolved Hide resolved
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@pkazmierczak pkazmierczak merged commit 611452e into main Feb 7, 2025
30 checks passed
@pkazmierczak pkazmierczak deleted the f-stateful-deployments-volume-assignment-table branch February 7, 2025 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants