Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter stuck with pod Pod should schedule on: nodeclaim/... #432

Open
darklight147 opened this issue Jul 11, 2024 · 7 comments
Open

Karpenter stuck with pod Pod should schedule on: nodeclaim/... #432

darklight147 opened this issue Jul 11, 2024 · 7 comments

Comments

@darklight147
Copy link

Version

Karpenter Version: v0.0.0

Kubernetes Version: v1.0.0

Expected Behavior

Create a new node

Actual Behavior

image Show the above message when describing a Pod but doesn't create any new Nodes

Steps to Reproduce the Problem

AKS Cluster with node auto provisioning enabled

Scale a deployment Nginx for example to 20 with memory request 8Gi

Resource Specs and Logs

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "12393960163388511505"
    karpenter.sh/nodepool-hash-version: v2
    kubernetes.io/description: General purpose NodePool for generic workloads
    meta.helm.sh/release-name: aks-managed-karpenter-overlay
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-07-11T00:34:44Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: karpenter-overlay-main-adapter-helmrelease
    helm.toolkit.fluxcd.io/namespace: 668f23a48709cf00012ccf73
  name: default
  resourceVersion: "485239"
  uid: 19ab7243-b704-4a1c-b9d8-279243f12865
spec:
  disruption:
    budgets:
    - nodes: 100%
    consolidationPolicy: WhenUnderutilized
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: karpenter.azure.com/sku-family
        operator: In
        values:
        - D
status:
  resources:
    cpu: "16"
    ephemeral-storage: 128G
    memory: 106086Mi
    pods: "110"

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Bryce-Soghigian
Copy link
Collaborator

Bryce-Soghigian commented Jul 11, 2024

Looking at the cluster id you shared, I see logs like creating instance, insufficient capacity, regional on-demand vCPU quota limit for subscription has been reached. To scale beyond this limit, please review the quota increase process here: https://learn.microsoft.com/en-us/azure/quotas/regional-quota-requests

If you do kubectl get events | grep karp, do you see any events like this?

@Bryce-Soghigian
Copy link
Collaborator

Bryce-Soghigian commented Jul 11, 2024

You can unblock your scaleup by requesting additional quota on the subscription for that region via following the steps in
this link https://learn.microsoft.com/en-us/azure/quotas/regional-quota-requests

@darklight147
Copy link
Author

darklight147 commented Jul 12, 2024

@Bryce-Soghigian
Empty events from the command

image

Also here is the current Quota sorted by Current usage

image

@darklight147
Copy link
Author

@Bryce-Soghigian hey any update on this? thank you 🚀

@maulik13
Copy link

We are seeing a similar behaviour in our cluster. Node claims are created, but they do not get in to the ready state. We do not see anything special in the events. It only says "Pod should schedule on: nodeclaim/app-g5gcw" and "Cannot disrupt NodeClaim" for existing nodes.

We have also checked that we have not reached our quota.

@maulik13
Copy link

Ref: #438 running az aks update -n cluster -g rg fixed the issue for us.

@maulik13
Copy link

We are still seeing this behavior time to time where nodeclaims are created resulting in creation of new VMs, but they do not manage to join the cluster and get in to a Ready = true state. How do we debug this or provide you with logs? The only solution is to reconcile the state by running an empty update against the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants