-
Notifications
You must be signed in to change notification settings - Fork 6
Workflows of gpu‐provisioner
- execute the command to create a NodeClaim:
kubectl apply -f examples/v1-nodeclaim-gpu.yaml
-
NodeClaim Lifecycle controller: https://github.com/Azure/gpu-provisioner/blob/main/vendor/sigs.k8s.io/karpenter/pkg/controllers/nodeclaim/lifecycle/controller.go
-
If gpu-provisioner component restarted during node launch, the process will be resumed after gpu-provisioner started. https://github.com/rambohe-ch/gpu-provisioner/blob/c33136c46447449449f98d00b0ca15e73dacea28/pkg/providers/instance/instance.go#L105
- execute the command to delete specified NodeClaim:
kubectl delete -f examples/v1-nodeclaim-gpu.yaml
- If there is no related node for the NodeClaim, the NodeClaim finalizer will be removed directly instead of blocking util node is ready.
- If NodeClaim is deleted during node launch, cloudprovider instance and node will be leaked. we add a new controller named
instance garbage collection controller
to cleanup leaked resource.
- execute the command to delete specified Node:
kubectl delete node {Node-Name}
-
Delete the specified NodePool on the AKS portal.
-
Our expectation is that NodeClaim will be leaked. nodeclaim.garbagecollection controller should garbage collect it. but in AKS this controller will not take effect.
-
When the backend NodePool is removed, The node in AKS will be deleted, so [node termination controller] will be triggered, which in turn triggers the [nodeclaim termination controller]. As a result, no NodeClaims will be leaked when backend NodePools are removed.
-
Combine these three workflows, resource removal sequence in gpu-provisioner is: CloudProvider Instance --> Node --> NodeClaim
- execute the command to create a NodeClaim:
kubectl apply -f examples/v1-nodeclaim-gpu.yaml
- then execute the command to delete the NodeClaim in 1min:
kubectl delete -f examples/v1-nodeclaim-gpu.yaml
- If NodeClaim is deleted during node launch, cloudprovider instance and node will be leaked.
- Instance Garbage Collect Controller will iterate all cloud provider instances in every 2min, and cleanup all leaked instances and nodes.
- code link: https://github.com/Azure/gpu-provisioner/blob/main/pkg/controllers/instance/garbagecollection/controller.go
- delete the specified node on the AKS portal. when node becomes not ready, NodeClaim status will also become not ready.
- if node not ready status exceeds 10min, NodeClaim garbage collect controller will delete the related NodeClaim. But it seems that RP has deleted node in AKS cluster before NodeClaim garbage collect controller takes affect.
- set ready condition false: https://github.com/Azure/gpu-provisioner/blob/6899cbac6138e4c9480bad7ea880104e5d584525/vendor/sigs.k8s.io/karpenter/pkg/controllers/nodeclaim/lifecycle/nodeready.go#L58
- garbage collect NodeClaim when not ready exceeds 10min: https://github.com/Azure/gpu-provisioner/blob/6899cbac6138e4c9480bad7ea880104e5d584525/vendor/sigs.k8s.io/karpenter/pkg/controllers/nodeclaim/garbagecollection/controller.go#L102