stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs #24993

pkazmierczak · 2025-01-31T18:33:17Z

We introduce an alternative solution to the one presented in #24960 which is based on the state store and not previous-next allocation tracking in the reconciler. This new solution reduces cognitive complexity of the scheduler code at the cost of slightly more boilerplate code, but also opens up new possibilities in the future, e.g., allowing users to explicitly "un-stick" volumes with workloads still running.

The diagram below illustrates the new logic:

     SetVolumes()                                               upsertAllocsImpl()          
     sets ns, job                             +-----------------checks if alloc requests    
     tg in the scheduler                      v                 sticky vols and consults    
            |                  +-----------------------+        state. If there is no claim,
            |                  | TaskGroupVolumeClaim: |        it creates one.             
            |                  | - namespace           |                                    
            |                  | - jobID               |                                    
            |                  | - tg name             |                                    
            |                  | - vol ID              |                                    
            v                  | uniquely identify vol |                                    
     hasVolumes()              +----+------------------+                                    
     consults the state             |           ^                                           
     and returns true               |           |               DeleteJobTxn()              
     if there's a match <-----------+           +---------------removes the claim from      
     or if there is no                                          the state                   
     previous claim                                                                         
|                             | |                                                      |    
+-----------------------------+ +------------------------------------------------------+    
                                                                                            
           scheduler                                  state store

Supersedes #24960
Fixes issues found in #24869

nomad/state/state_store.go

nomad/fsm.go

tgross

This is looking great, @pkazmierczak!

scheduler/feasible.go

nomad/structs/structs.go

nomad/state/state_store.go

tgross

Approach looks good! One more item to clean up in the feasibility check, I think.

nomad/state/state_store_task_group_volume_claims.go

scheduler/feasible.go

nomad/state/state_store_task_group_volume_claims.go

tgross · 2025-02-06T18:58:45Z

scheduler/feasible.go

 	}
+
+	storedClaims, err := ctx.State().GetTaskGroupHostVolumeClaims(nil)


Somewhere along the way we dropped that this set of claims should only be the ones for this task group. This would be all claims for all task groups, which we don't care about.

scheduler/feasible.go

scheduler/feasible_test.go

tgross

Looks great! Just a couple small items left to address and I think this is ready to merge.

scheduler/feasible.go

scheduler/feasible_test.go

nomad/state/state_store_test.go

tgross

LGTM! 👍

pkazmierczak added 5 commits January 30, 2025 11:59

taskVolumeAssignmentSchema

47e8969

state store methods and struct definition

7f7fecd

wip

b0cc34f

clean up

22f5827

a few missing pieces

42a727a

tgross reviewed Jan 31, 2025

View reviewed changes

nomad/state/state_store.go Outdated Show resolved Hide resolved

nomad/fsm.go Outdated Show resolved Hide resolved

Tim's comments

727e564

vercel bot deployed to Preview – nomad-ui February 3, 2025 09:34 View deployment

pkazmierczak added 2 commits February 3, 2025 21:00

better schema?

3cb76b5

wip

088d525

vercel bot deployed to Preview – nomad-ui February 3, 2025 20:40 View deployment

pkazmierczak added 2 commits February 4, 2025 13:24

remove host volume IDs field from allocation

d0f5092

remove from api

0a11627

vercel bot deployed to Preview – nomad-ui February 4, 2025 12:29 View deployment

working prototype

554dda9

vercel bot deployed to Preview – nomad-ui February 4, 2025 13:03 View deployment

tgross reviewed Feb 4, 2025

View reviewed changes

scheduler/feasible.go Outdated Show resolved Hide resolved

nomad/structs/structs.go Outdated Show resolved Hide resolved

nomad/state/state_store.go Outdated Show resolved Hide resolved

nomad/state/state_store.go Outdated Show resolved Hide resolved

clean-ups and Tim's comments

10599dc

vercel bot deployed to Preview – nomad-ui February 4, 2025 18:20 View deployment

pkazmierczak added 2 commits February 5, 2025 10:29

feasibility test correction

06b3f98

basic test

0e7ae55

vercel bot deployed to Preview – nomad-ui February 5, 2025 10:18 View deployment

pkazmierczak requested review from schmichael, tgross and gulducat February 5, 2025 10:20

pkazmierczak self-assigned this Feb 5, 2025

pkazmierczak added the theme/scheduling label Feb 5, 2025

pkazmierczak added this to the 1.10.0 milestone Feb 5, 2025

pkazmierczak marked this pull request as ready for review February 5, 2025 10:20

pkazmierczak requested a review from a team as a code owner February 5, 2025 10:20

vercel bot deployed to Preview – nomad-ui February 5, 2025 16:07 View deployment

refactor the logic in hasVolumes

8a0cec1

vercel bot deployed to Preview – nomad-ui February 5, 2025 19:03 View deployment

simplify conditional

a516ba7

vercel bot deployed to Preview – nomad-ui February 5, 2025 19:08 View deployment

better test

4261a48

vercel bot deployed to Preview – nomad-ui February 6, 2025 15:35 View deployment

pkazmierczak mentioned this pull request Feb 6, 2025

scheduler: preserve allocations enriched during placement as 'informational' #24960

Closed

renamed to emphasize it's about host volumes

240d1dd

vercel bot deployed to Preview – nomad-ui February 6, 2025 15:44 View deployment

tgross reviewed Feb 6, 2025

View reviewed changes

nomad/state/state_store_task_group_volume_claims.go Outdated Show resolved Hide resolved

scheduler/feasible.go Outdated Show resolved Hide resolved

Tim's comment

6b68667

vercel bot deployed to Preview – nomad-ui February 6, 2025 17:21 View deployment

keep track of "used" claims

aa56c30

vercel bot deployed to Preview – nomad-ui February 6, 2025 17:44 View deployment

tgross reviewed Feb 6, 2025

View reviewed changes

remove TaskGroupHostVolumeClaimRegisterRequestType msg

408212c

vercel bot deployed to Preview – nomad-ui February 6, 2025 19:30 View deployment

pkazmierczak added 3 commits February 7, 2025 13:37

test for the upsert method

e90b923

update feasibility checker to only fetch task group claims

34c1bf5

fix feasibility test

34f139e

vercel bot deployed to Preview – nomad-ui February 7, 2025 12:51 View deployment

tgross reviewed Feb 7, 2025

View reviewed changes

scheduler/feasible.go Outdated Show resolved Hide resolved

scheduler/feasible.go Outdated Show resolved Hide resolved

scheduler/feasible_test.go Show resolved Hide resolved

nomad/state/state_store_test.go Outdated Show resolved Hide resolved

Tim's comments

eb29282

vercel bot deployed to Preview – nomad-ui February 7, 2025 16:20 View deployment

tgross approved these changes Feb 7, 2025

View reviewed changes

pkazmierczak merged commit 611452e into main Feb 7, 2025
30 checks passed

pkazmierczak deleted the f-stateful-deployments-volume-assignment-table branch February 7, 2025 16:41

pkazmierczak mentioned this pull request Feb 7, 2025

E2E: dynamic host volume tests for sticky volumes #24869

Merged

tgross mentioned this pull request Feb 7, 2025

E2E: dynamic host volumes sticky volumes drain test fix #25063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs #24993

stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs #24993

pkazmierczak commented Jan 31, 2025 •

edited

Loading

tgross left a comment

tgross left a comment

tgross Feb 6, 2025

tgross left a comment

tgross left a comment

		}

		storedClaims, err := ctx.State().GetTaskGroupHostVolumeClaims(nil)

stateful deployments: use TaskGroupVolumeClaim table to associate volume requests with volume IDs #24993

stateful deployments: use TaskGroupVolumeClaim table to associate volume requests with volume IDs #24993

Conversation

pkazmierczak commented Jan 31, 2025 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross Feb 6, 2025

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs #24993

stateful deployments: use `TaskGroupVolumeClaim` table to associate volume requests with volume IDs #24993

pkazmierczak commented Jan 31, 2025 •

edited

Loading