Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeClaims stuck deleting due to finalizer #525

Closed
flickers opened this issue Oct 15, 2024 · 2 comments · Fixed by #633
Closed

NodeClaims stuck deleting due to finalizer #525

flickers opened this issue Oct 15, 2024 · 2 comments · Fixed by #633
Assignees
Labels
area/nodeclaim Issues or PRs related to NodeClaim lifecycle management area/spot Issues or PRs related to spot kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@flickers
Copy link

Version

Karpenter Version: v0.5.4

Kubernetes Version: v1.30.3

Expected Behavior

When a node is deleted that the Nodeclaim is also eventually deleted

Actual Behavior

After running karpenter for a while in 3 separate clusters in 3 separate subscriptions we have seen karpenter fail to delete nodeclaims

  • The kubernetes node has been deleted
  • The Azure resources have been deleted (nic, disk and vm)
  • Karpenter nodeclaim still exists for the node
  • Karpenter nodeclaim has .metadata.deletionTimestamp like it has been deleted but finalizer is preventing deletion

We have been running the following command that removes the finalizer for the nodeclaims that have .metadata.deletionTimestamp

kubectl get nodeclaim -o json | jq -r '.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name' | xargs -I {} kubectl patch nodeclaim {} --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'

Steps to Reproduce the Problem

Not sure how to reproduce, we are also running Karpenter in AWS and have done that from the early beginning and never seen something like this.
Saw a PR for something similar in karpenter-sigs repo but that might be related to Karpenter v1
kubernetes-sigs/karpenter#1705

Resource Specs and Logs

{
    "level": "ERROR",
    "time": "2024-10-15T12:49:39.023Z",
    "logger": "controller.nodeclaim.termination",
    "message": "virtualMachine.Delete for aks-aks-prd-nodepool-kgcpq failed: GET https://management.azure.com/subscriptions/xxxxxxx-5535-4267-8839-xxxxxxxxx/resourceGroups/rg-aks-prd-nodes/providers/Microsoft.Compute/virtualMachines/aks-aks-prd-nodepool-kgcpq\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"NotFound\",\n    \"message\": \"The entity was not found in this Azure location.\"\n  }\n}\n--------------------------------------------------------------------------------\n",
    "commit": "846ef96",
    "nodeclaim": "aks-prd-nodepool-kgcpq",
    "node": "aks-aks-prd-nodepool-kgcpq",
    "provider-id": "azure:///subscriptions/xxxxxxx-5535-4267-8839-xxxxxxxxx/resourceGroups/rg-aks-prd-nodes/providers/Microsoft.Compute/virtualMachines/aks-aks-prd-nodepool-kgcpq",
    "nodeclaim": "aks-prd-nodepool-kgcpq"
},
{
    "level": "ERROR",
    "time": "2024-10-15T12:49:39.111Z",
    "logger": "controller",
    "message": "Reconciler error",
    "commit": "846ef96",
    "controller": "nodeclaim.termination",
    "controllerGroup": "karpenter.sh",
    "controllerKind": "NodeClaim",
    "NodeClaim": {
        "name": "aks-prd-nodepool-kgcpq"
    },
    "namespace": "",
    "name": "aks-prd-nodepool-kgcpq",
    "reconcileID": "2d7a5f01-0b83-47fc-b4ee-2bbbf2a16e8b",
    "error": "terminating cloudprovider instance, GET https://management.azure.com/subscriptions/xxxxxxx-5535-4267-8839-xxxxxxxxx/resourceGroups/rg-aks-prd-nodes/providers/Microsoft.Compute/virtualMachines/aks-aks-prd-nodepool-kgcpq\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n  \"error\": {\n    \"code\": \"NotFound\",\n    \"message\": \"The entity was not found in this Azure location.\"\n  }\n}\n--------------------------------------------------------------------------------\n"
}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@iamsaurabhgupt
Copy link

iamsaurabhgupt commented Oct 17, 2024

we are also facing the same issue and had to redeploy the cluster multiple times using karpenter.
it works initially but it starts failing after 2-3 days

on checking karpenter logs we find that it gets stuck with an error like:
{"level":"ERROR","time":"2024-10-17T21:42:10.740Z","logger":"controller","message":"Reconciler error","commit":"8xxxx6","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"spot-pool-dxxfh"},"namespace":"","name":"spot-pool-dxxfh","reconcileID":"bxxxxxxx-xxxxxxx-xxca","error":"terminating cloudprovider instance, GET https://management.azure.com/subscriptions/xxxxxx/resourceGroups/MC_xxxx_eastus/providers/Microsoft.Compute/virtualMachines/aks-spot-pool-dxxfh\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 Not Found\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n "error": {\n "code": "NotFound",\n "message": "The entity was not found in this Azure location."\n }\n}\n--------------------------------------------------------------------------------\n"}

@tallaxes tallaxes added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. area/spot Issues or PRs related to spot area/nodeclaim Issues or PRs related to NodeClaim lifecycle management labels Oct 22, 2024
@tallaxes
Copy link
Collaborator

This is a bug; we should be returning NewNodeClaimNotFoundError from cloudprovider.Delete if VM is not there.

@tallaxes tallaxes added kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 22, 2024
@tallaxes tallaxes self-assigned this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nodeclaim Issues or PRs related to NodeClaim lifecycle management area/spot Issues or PRs related to spot kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants