Skip to content

Workflows of gpu‐provisioner

rambohe edited this page Jan 5, 2025 · 11 revisions

1. Launch NodeClaim

  • execute the command to create a NodeClaim: kubectl apply -f examples/v1-nodeclaim-gpu.yaml
image

2. Remove NodeClaim

  • execute the command to delete specified NodeClaim: kubectl delete -f examples/v1-nodeclaim-gpu.yaml
image
  • If there is no related node for the NodeClaim, the NodeClaim finalizer will be removed directly instead of blocking util node is ready.
  • If NodeClaim is deleted during node launch, cloudprovider instance and node will be leaked. we add a new controller named instance garbage collection controller to cleanup leaked resource.

3. Remove Node

  • execute the command to delete specified Node: kubectl delete node {Node-Name}
image

4. Remove NodePool

  • Delete the specified NodePool on the AKS portal.

  • Our expectation is that NodeClaim will be leaked. nodeclaim.garbagecollection controller should garbage collect it. but in AKS this controller will not take effect.

  • When the backend NodePool is removed, The node in AKS will be deleted, so [node termination controller] will be triggered, which in turn triggers the [nodeclaim termination controller]. As a result, no NodeClaims will be leaked when backend NodePools are removed.

  • Combine these three workflows, resource removal sequence in gpu-provisioner is: CloudProvider Instance --> Node --> NodeClaim

5. CloudProvider Instances Garbage Collect

  • execute the command to create a NodeClaim: kubectl apply -f examples/v1-nodeclaim-gpu.yaml
  • then execute the command to delete the NodeClaim in 1min: kubectl delete -f examples/v1-nodeclaim-gpu.yaml
image

6. Node NotReady

  • delete the specified node on the AKS portal. when node becomes not ready, NodeClaim status will also become not ready.
  • if node not ready status exceeds 10min, NodeClaim garbage collect controller will delete the related NodeClaim. But it seems that RP has deleted node in AKS cluster before NodeClaim garbage collect controller takes affect.
image