Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instance volume limits: workloads no longer attach ebs volumes #1163

Closed
aydosman opened this issue Feb 4, 2022 · 62 comments
Closed

instance volume limits: workloads no longer attach ebs volumes #1163

aydosman opened this issue Feb 4, 2022 · 62 comments

Comments

@aydosman
Copy link

aydosman commented Feb 4, 2022

/kind bug

What happened?
Workloads stop attaching ebs volumes due to reaching instance volume limits, expected number of replicas for our requirement isn’t met and pods are in a pending state.

Nodes have the appropriate limit set to 25 but the scheduler sends more than 25 pods with volumes to a node.

kubelet Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[kube-api-access-redact test-volume]: timed out waiting for the condition

attachdetach-controller AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

ebs-csi-controller driver.go:119] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

How to reproduce it (as minimally and precisely as possible)?

Deploying the test below should be sufficient in simulating the problem

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Update: adding a liveness probe with an initial delay of 60 seconds seems to get around the problem, our nodes scale, the replica count is correct with volumes attached.

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 60
          periodSeconds: 10          
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Environment

  • Kubernetes version: Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
  • Version: Helm Chart: v2.6.2 Driver v1.5.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 4, 2022
@ryanpxyz
Copy link

Hello,

I believe we are seeing this too.

Warning  FailedAttachVolume  34s (x11 over 8m47s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-61b4bf2c-541f-4ef1-9f21-redacted" : rpc error: code = Internal desc = Could not attach volume "vol-redacted" to node "i-redacted": attachment of disk "vol-redacted" failed, expected device to be attached but was attaching 

...with:

# /bin/kubelet --version
Kubernetes v1.20.11-eks-f17b81

and:

k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.4.0

Thanks,

Phil.

@stevehipwell
Copy link
Contributor

I suspect this is a race condition somewhere, my current thoughts are that it's the scheduler but I haven't had a chance to look at it further.

@gnufied
Copy link
Contributor

gnufied commented Feb 16, 2022

is CSI driver running with correctly defined limits? What does CSINode object from node reports?

@stevehipwell
Copy link
Contributor

@gnufied the CSI driver looks to be doing everything correctly, AFAIK the only thing it needs to do is report the max PV attachments it can make. As reported above if you add in latency between pod scheduling the pods are sent to nodes with space for the PV mounts which is why I suspect it's a scheduler issue.

@gnufied
Copy link
Contributor

gnufied commented Feb 16, 2022

@stevehipwell No that shouldn't happen. We start counting volumes against the limit before pods are even started on the node. I am still waiting on output of CSINode object from the problematic node.

@stevehipwell
Copy link
Contributor

@gnufied I agree that the CSI driver is reporting correctly, which combined with the 60s wait fixing the issue makes me believe that this issue is actually happening elsewhere as a race condition.

@sultanovich
Copy link

I am seeing the same problem in my environment:

Events:
  Type     Reason       Age   From                                   Message
  ----     ------       ----  ----                                   -------
  Normal   Scheduled    2m5s  default-scheduler                      Successfully assigned 2d3b9d81e0b0/master-0 to ip-10-3-109-222.ec2.internal
  Warning  FailedMount  2s    kubelet, ip-10-3-109-222.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[master], unattached volumes=[master backups filebeat-configuration default-token-vjscf]: timed out waiting for the condition

I haven't been able to confirm if they are related, but I see this happening on the same node that has eni's interfaces in "attaching" state and I see the following errors in /var/log/aws-routed-eni/ipamd.log:

{"level":"error","ts":"2022-02-01T19:45:34.410Z","caller":"ipamd/ipamd.go:805","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI: error attaching ENI: attachENI: failed to attach ENI:AttachmentLimitExceeded: Interface count 9 exceeds the limit for c5.4xlarge\n\tstatus code: 400, request id: 836ce9b1-ec63-4935-a007-739e32f506cb"}

For reference, the c5.x4xlarge instance type in AWS supports 8 eni's.

We are testing why we can limit it using the volume-attach-limit variable which is not set yet, but I would like to understand first why it happens and if there is a way to not hardcode that value.

Environment

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
[user@admin [~] ~]$ kubectl -n kube-system get deployment ebs-csi-controller -o wide -o yaml | grep "image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver"
        image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
[user@admin [~] ~]$
[root@i-6666919f09cc78046 ~]# /usr/bin/kubelet --version
Kubernetes v1.19.15-eks-9c63c4
[root@i-6666919f09cc78046 ~]#

@ryanpxyz
Copy link

ryanpxyz commented Feb 17, 2022

Hello,

... update from our side:

Our first simple workaround as we first observed the problem yesterday (might help others who are stuck and looking for a 'quick fix'):

... cordon the current node that the pod is stuck in 'Init ...' on.
... delete the pod ...
... verify that the pod is started successfully on an alternative node. If not ...
... repeat 'cordoning' until the pod is successfully deployed.
... uncorden (all) node(s) upon successful deployment.

Then following a dive into the CSI EBS driver code, we passed the option '--volume-attach-limit=50' to the 'node driver'. I haven't tested this explicitly yet however.

The problem to me seems to be a missing feedback loop between the 'node driver' and the scheduler.

The scheduler says, "Hey, there's a node that satisfies my scheduling criteria ... I'll schedule the workload to run there ..." and the node driver says, "OK, I have a workload but I've reached this '25 attached volumes' limit so I'm done here ...".

This is just my perhaps primitive view of the situation.

Thanks,

Phil.

PS ... following a re-deployment of the 'csi ebs node driver' we are still seeing the attribute 'attachable-volumes-aws-ebs' as set to 25 on a 'describe node':

image

... we weren't expecting this.

@stevehipwell
Copy link
Contributor

@ryanpxyz looking at the code I think the CSI just reports how many attachments it can make. Until the PR to make this dynamic is merged and released this is a fixed value by instance type or arg. This means there are two related but distinct issues.

The first is the incorrect max value that doesn't take into accoun all nitro instances and their other attachments. For example a nitro instance (only 5 series) and no arg will have a limit of 25, which is correct as long as you only have 3 extra attachments. If you're using custom networking and prefixes this means instances without an additional NVMe drive work but ones with this get stuck.

The second problem, which is what this issue is tracking, is that when meeting the criteria for a correctly reported max it is still possible that too many pods will be scheduled on a node.

@stevehipwell
Copy link
Contributor

@sultanovich see my reply above about the attachment limits. I think there is a separate issue and PR for resolving your problem.

@aydosman
Copy link
Author

@gnufied output from the workers while running the original tests with limits

Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:34:19 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>



Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:44:10 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>

@sultanovich
Copy link

@stevehipwell I have no doubts as to the limits that can be annexed. My question is about why it happens and how to solve it.

I have generated a new issue (#1174) since the volume-attach-limit argument has not worked for me either.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

@stevehipwell
Copy link
Contributor

@sultanovich I think I've explained pretty well everything I know about this issue. Let me reiterate that there are two bugs here, the first one which is related to nodes not being picked up as nitro or not having 25 free attachment slots is being addressed by #1075, the second and currently unexplained is related to the speed at which requests for pods with PVs are sent to the scheduler. The second scenario is what this issue was opened for, with your new issue there are now a number of other issues relating to the first scenario.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

The current driver doesn't take any dynamic attachments into consideration, you get 25 if the node is detected as nitro or 39 if not. If you are getting failures on nitro instance that isn't a 5 series, or has NVMe drives or is using more than 2 ENIs you should be able to statically fix the problem by using the --volume-attach-limit argument. If you're using a m5 instance but requesting lots of PVs it's likely that you're seeing this issue; you should be able to stop it happening by changing your deployment strategy and adding a wait between pods.

@gnufied
Copy link
Contributor

gnufied commented Feb 21, 2022

@ryanpxyz you are looking at wrong place for attachable limits of CSI driver. Attach limit of CSI driver is reported via CSINode objects. if we are not rebuilding CSINode objects during redeploy of driver - that sounds like a bug. So setting --volume-attach-limit and redeploying driver should set correct limits.

As for bug in scheduler - here is the code for counting the limits https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L210 . Its been awhile since I looked in to the scheduler code, but if scheduler is not respecting limits reported by CSINode then that would be a k/k bug (and we are going to need one).

@gnufied
Copy link
Contributor

gnufied commented Feb 21, 2022

@bertinatto - are you aware of a bug where if many pods are scheduled at once to a node then scheduler may not correctly count the volume limits?

@stevehipwell
Copy link
Contributor

@gnufied it looks like it's the Filter function that is doing the work we're interested in. Unless only a single pod can be scheduled at a time, which is unlikely, this code looks like it isn't checking for other in flight requests and could easily result in over provisioning volumes on a node.

I would expect to see something to lock a CSINode so only one calculation at a time could run, but I might be missing something here as I'm not really familiar with this part of the codebase?

As an aside would supporting Storage Capacity Tracking help limit the blast radius of this issue?

@sultanovich
Copy link

@gnufied I tried setting the --volume-attach-limit argument in a test environment and it worked fine.
The only limitation that I find is that it applies to the entire cluster, if I have nodes, the other types of instances in AWS could limit the number of volumes that I can host, increasing infrastructure costs.

Do you have any idea how long it might take to modify this check to take the correct limits on all instance types?

@stevehipwell
Copy link
Contributor

@sultanovich this issue isn't the same as #1174, please don't confuse them.

@stevehipwell
Copy link
Contributor

@gnufied @bertinatto do you have any more thoughts on this? I doubt I've read the code correctly so would appreciate someone looking at the code I mentioned above to see if they can see the same potential issue?

@stevehipwell
Copy link
Contributor

On further testing of this it looks like this has been fixed via an EKS platform version update (I suspect), I'd be interested if anyone knows what exactly was fixed?

@jrsdav
Copy link

jrsdav commented Apr 20, 2022

@stevehipwell The EKS AMI changelog for the most recent v20220406 release had one interesting note that might be relevant:

The bootstrap script will auto-discover maxPods values when instanceType is missing in eni-max-pods.txt

@stevehipwell
Copy link
Contributor

@jrsdav thanks for looking out but that functionality sets a kubelet arg (incorrectly in most cases) and isn't related to storage attachments. This issue wasn't ever about the correct max value being set for attachments, that's a separate issue with a fix coming in the next minor version, it was a scheduling issue that didn't make much sense.

@LeeHampton
Copy link

We're experiencing this issue as well. Except, from some of the discussion above it sounds like people think it's some kind of scheduling race condition. In our case, it seems like the volume attachments are never being properly counted. We have a node with 25 attachments, but the Allocated Resources section under kubectl describe node show zero attachments:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests        Limits
  --------                    --------        ------
  cpu                         35170m (73%)    38 (79%)
  memory                      130316Mi (68%)  131564Mi (68%)
  ephemeral-storage           0 (0%)          0 (0%)
  hugepages-1Gi               0 (0%)          0 (0%)
  hugepages-2Mi               0 (0%)          0 (0%)
  attachable-volumes-aws-ebs  0               0

Any leads on what might be causing that to happen?

@gnufied
Copy link
Contributor

gnufied commented Apr 29, 2022

Again looks like you are looking at wrong object. CSI volume limits are counted via CSINode objects. So please check what value it is reporting.

@LeeHampton
Copy link

@gnufied Ah, okay. Thank you. It looks like the "allocatables" are indeed being properly counted, which I guess puts us in the race condition boat:

k describe csinode  ip-172-20-60-87.ec2.internal


Name:               ip-172-20-60-87.ec2.internal
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Wed, 27 Apr 2022 05:12:27 -0400
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-0f37978c6d1e25a52
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone topology.kubernetes.io/zone]
Events:               <none>

@LeeHampton
Copy link

@gnufied , actually is "Allocatables" just the total limit? How do I see what it thinks is currently allocated?

@Legion2
Copy link

Legion2 commented May 7, 2022

We are using csi volumes and in-tree volumes at the same time and see similar errors. Even if csi volumes are counted correctly, there are also non csi volumes attached to the nodes which result in the underlying node limited to be exceeded. Is this situation addressed by any of the linked issues?

@dyasny
Copy link

dyasny commented Nov 3, 2022

So within the context of my use case - when I try a lot of small workloads on an m6a Node which is otherwise capable of supporting hundreds of Pods I am inevitably going to run into the issue of "running out" of available attachments if all my Pods require their own volume.

To make matters worse a large number of small Pods all requiring IP addresses increases the amount of ENI attachments on my Node which further lowers my available EBS attachments.

My use case exactly. Essentially, this is AWS forcing you to pay for more instances. I am currently working on two things:

  1. Drop VPC-CNI for some overlay-based setup, this should mitigate the ENI attachment limitation (yes I am aware of the prefixes hack and it still doesn't cut it)
  2. Drop the usage of EBS for something self managed and more suitable for the use case of many small pods with many small volumes attached.

@stevehipwell
Copy link
Contributor

@sotiriougeorge not directly CSI related but I'd suggest switching over to IP Prefix mode which should mean you only need a single ENI (or 2 for custom networking). Secondly according to to the Kubernetes documentation 110 pods per node is the upper limit and is a good rule of thumb, the original EKS limit is based on the maximum IPs per instance which I can't see any real justification for once it passed the 110 value. Thirdly Kubernetes isn't designed for primarily stateful workloads and where they are used it's usually for a service which has a high resource requirement meaning that you don't need to bin pack lots of pods onto the same node.

Out of interest what sort of load testing needs actual volumes per pod rather than just an emptyDir?

@sotiriougeorge
Copy link

sotiriougeorge commented Nov 3, 2022

@stevehipwell the concept is that the platform I am working on (obviously backed up by EKS at this point) offers its end-users the option to deploy their own workloads, all of which ... or rather most of which are backed by EBS volumes.

So the stress-testing of the cluster aims to discover what would happen if the users "went ham" on the platform and what kind of restrictions should be put in place as far as workload deployments are concerned.

Out of interest what sort of load testing needs actual volumes per pod rather than just an emptyDir?

I'd say it's more of a "volumes per most Pods".

I thank you for your suggestions though , the IP Prefix mode is something I had seen but hadn't found the time to deep dive and see how it would help me - sometimes you can only absorb so much new info. I'm done hijacking this thread!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2023
@greenaar
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2023
@torredil torredil removed the kind/bug Categorizes issue or PR as related to a bug. label Mar 22, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2023
@greenaar
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2023
@tristanaj
Copy link

tristanaj commented Oct 11, 2023

Unless I'm mistaken, this still seems to be an issue in Kubernetes v1.28 (on EKS) with version v1.23.1 of the EBS CSI Driver. The following (albeit unrealistic) example reproduces the problem, by trying to send 26 pods in a statefulset with one PVC each to the same node. I would hope the scheduler wouldn't do this, but instead on my nodes it gets to 24 pods and the 24th gets stuck in a pending state complaining that it can't attach the volume. Is this likely to be fixed in an upcoming release? It's causing us major problems. Obviously we're not trying to send 26 pods with PVCs to the same node, but intermittently in our application the scheduler tries to schedule a pod with a PVC that won't attach because the attachment quota has been breached, causing downtime and instability. Is there any workaround for this? Thanks in advance.

apiVersion: v1
kind: Namespace
metadata:
  name: vols-test
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols-test
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 26
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      affinity:
        podAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
              matchExpressions:
               - key: app
                 operator: In
                 values:
                 - vols-pv-test
             topologyKey: "kubernetes.io/hostname"
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "ebs-sc" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

@tristanaj
Copy link

/remove-lifecycle stale

@idanshaby
Copy link

Happens to me as well.
So do we have a bug on the scheduler? Because it indeed looks like the driver reports whatever it needs to report.
Other than the workaround of setting the volumeAttachLimit parameter on the CSI add-on (which practically imposes the limitation of using a single instance type in the whole cluster), did anyone find any better workaround?

@torredil
Copy link
Member

@idanshaby Take a look at the recently introduced additional DaemonSets feature: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/additional-daemonsets.md, this will enable you to specify different volume attachment limits per instance type.

@idanshaby
Copy link

Looks interesting, @torredil . Thanks for that!
I guess it will take some time until the EKS add-on will consume it, but worth waiting.

@torredil
Copy link
Member

@idanshaby The add-on schema has already been updated to include this parameter!

$ eksctl utils describe-addon-configuration --name aws-ebs-csi-driver --version v1.25.0-eksbuild.1 | yq

    "additionalDaemonSets": {
      "default": {},
      "description": "Additional DaemonSets of the node pod",
      "patternProperties": {
        "^.*$": {
          "$ref": "#/properties/node",
          "type": "object"
        }
      },
      "type": "object"
    },

@ckhelifi
Copy link

ckhelifi commented Feb 1, 2024

Hi there,

I have a similar issue and i presume that the ebs-node calculate only at startup the volume attachment instance limit but this one can change if network interfaces are attached in a second time on the node (this is what Amazon VPC CNI plugin does).

Am i right ?
The fact that allocatable.count value change when i restarted the ebs-csi-node on my node let me think about that :

kubectl get csinode ip-xx-xx-xxx-xx.eu-west-3.compute.internal -o yaml
---
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2024-01-31T20:17:17Z"
  name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
    uid: 44b1435c-8af5-455c-a8fa-5070512f623a
  resourceVersion: "3089141802"
  uid: 2eedb777-da67-40ba-8aee-e6d844f8ec37
spec:
  drivers:
  - name: efs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys: null
  - allocatable:
      count: 26
    name: ebs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys:
    - topology.ebs.csi.aws.com/zone
---
kubectl delete pod ebs-csi-node-w6gvq -n kube-system
pod "ebs-csi-node-w6gvq" deleted
---
kubectl get csinode ip-xx-xx-xxx-xx.eu-west-3.compute.internal -o yaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2024-01-31T20:17:17Z"
  name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
    uid: 44b1435c-8af5-455c-a8fa-5070512f623a
  resourceVersion: "3090762035"
  uid: 2eedb777-da67-40ba-8aee-e6d844f8ec37
spec:
  drivers:
  - name: efs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys: null
  - allocatable:
      count: 22
    name: ebs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 31, 2024
@AndrewSirenko
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 6, 2024
@ConnorJC3
Copy link
Contributor

/close

Hey everyone, we're going to go ahead and close out this issue as it has become a mess of related "volume limit" related issues. Below is the current status of volume limits in the EBS CSI Driver.

If you are experiencing a volume limit related bug on a supported version of the EBS CSI Driver that is not described below, please open a new issue for evaluation.

Driver Version

Everything below only applies if you're using a supported version of the driver - that is one of the most two recent minor releases. Older versions of the driver do not receive bug fixes and contain known issues related to volume limits, and will not receive support for these issues. See the support policy for more information.

If you are on an older version of the driver and experiencing a volume limit related issue, the first step is to upgrade to the latest stable version to ensure your issue is not due to a bug that has already been fixed.

Volume Limit Calculation

Currently, the EBS CSI driver can receive metadata from two sources - either IMDS or the Kubernetes API. Both sources provide the instance type and name, but only IMDS provides the number of ENIs and attached volumes to the instance.

Thus, if an instance has more than 1 ENI attached, or any non-CSI EBS volumes attached other than the root volume (such as an extra data volume for /var/lib/containerd/), the volume limit calculation will only be correct if using IMDS metadata.

We are also aware of an issue where the limit is incorrectly calculated on instances with GPUs or accelerators - that is being tracked in #2105 and we plan to fix soon.

Changing Volume Limit

Currently, Kubernetes only probes CSI Drivers for the volume limit during startup. This means that if an additional slot is taken by something other than a CSI volume (such as an additional ENI being attached to the volume.

This is common when using the VPC CNI plugin, as it will sometimes attach addiitonal ENIs if the number of IPs that are available reaches zero. This can be partially mitigated by using VPC CNI's prefix delegation feature.

Solutions

We are looking into proposing ways the CSI spec and upstream components can be improved to better account for the dynamic nature of volume limits on EC2/EBS. In the meantime, we recommend adopting one or more of the below solutions if you are experiencing volume limit issues caused by the limit changing after startup:

Dedicated EBS Instance Types

Some gen7 and later EC2 instance types have a dedicated EBS volume limit. If you are experiencing issues due to slots being taken by non-volume attachments, such as ENIs allocated by the VPC CNI driver, using these instance types can remediate that issue because their volume slots are dedicated solely for EBS and will not be used by ENIs/GPUs/etc.

Volume Attach Limit Configuration

The EBS CSI Driver can be started with the CLI option --volume-attach-limit (Helm parameter node.volumeAttachLimit) to explicitly specific the limit for volumes to be reported to Kubernetes.

This parameter can be used in cases where you have a known safe limit.

Reserved Volume Attachments

The EBS CSI Driver can be started with the CLI option --reserved-volume-attachments (Helm parameter node.reservedVolumeAttachments) to reserve a number of slots for non-CSI volumes above what the driver detects is already in use on startup. These reserved slots will be subtracted from the total slots reported to Kubernetes.

This parameter can be used when the maximum number of slots that will be used by ENIs/non-CSI volumes/etc is known in advance.

Helm Additional DaemonSets

For clusters that need a mix of the above solutions, the Helm chart has the ability to construct multiple DaemonSets via the additionalDaemonSets parameter.

For example, the below Helm configuration would configure three different DaemonSets each with their own specific attahc limit:

node:
  nodeSelector:
    node.kubernetes.io/instance-type: c5.large
  volumeAttachLimit: 25

additionalNodeDaemonSets:
  big:
    nodeSelector:
      node.kubernetes.io/instance-type: m7i.48xlarge
    volumeAttachLimit: 100
  small:
    nodeSelector:
      node.kubernetes.io/instance-type: t3.medium
    volumeAttachLimit: 5

For more information about this feature, see the Additional DaemonSets docs

@k8s-ci-robot
Copy link
Contributor

@ConnorJC3: Closing this issue.

In response to this:

/close

Hey everyone, we're going to go ahead and close out this issue as it has become a mess of related "volume limit" related issues. Below is the current status of volume limits in the EBS CSI Driver.

If you are experiencing a volume limit related bug on a supported version of the EBS CSI Driver that is not described below, please open a new issue for evaluation.

Driver Version

Everything below only applies if you're using a supported version of the driver - that is one of the most two recent minor releases. Older versions of the driver do not receive bug fixes and contain known issues related to volume limits, and will not receive support for these issues. See the support policy for more information.

If you are on an older version of the driver and experiencing a volume limit related issue, the first step is to upgrade to the latest stable version to ensure your issue is not due to a bug that has already been fixed.

Volume Limit Calculation

Currently, the EBS CSI driver can receive metadata from two sources - either IMDS or the Kubernetes API. Both sources provide the instance type and name, but only IMDS provides the number of ENIs and attached volumes to the instance.

Thus, if an instance has more than 1 ENI attached, or any non-CSI EBS volumes attached other than the root volume (such as an extra data volume for /var/lib/containerd/), the volume limit calculation will only be correct if using IMDS metadata.

We are also aware of an issue where the limit is incorrectly calculated on instances with GPUs or accelerators - that is being tracked in #2105 and we plan to fix soon.

Changing Volume Limit

Currently, Kubernetes only probes CSI Drivers for the volume limit during startup. This means that if an additional slot is taken by something other than a CSI volume (such as an additional ENI being attached to the volume.

This is common when using the VPC CNI plugin, as it will sometimes attach addiitonal ENIs if the number of IPs that are available reaches zero. This can be partially mitigated by using VPC CNI's prefix delegation feature.

Solutions

We are looking into proposing ways the CSI spec and upstream components can be improved to better account for the dynamic nature of volume limits on EC2/EBS. In the meantime, we recommend adopting one or more of the below solutions if you are experiencing volume limit issues caused by the limit changing after startup:

Dedicated EBS Instance Types

Some gen7 and later EC2 instance types have a dedicated EBS volume limit. If you are experiencing issues due to slots being taken by non-volume attachments, such as ENIs allocated by the VPC CNI driver, using these instance types can remediate that issue because their volume slots are dedicated solely for EBS and will not be used by ENIs/GPUs/etc.

Volume Attach Limit Configuration

The EBS CSI Driver can be started with the CLI option --volume-attach-limit (Helm parameter node.volumeAttachLimit) to explicitly specific the limit for volumes to be reported to Kubernetes.

This parameter can be used in cases where you have a known safe limit.

Reserved Volume Attachments

The EBS CSI Driver can be started with the CLI option --reserved-volume-attachments (Helm parameter node.reservedVolumeAttachments) to reserve a number of slots for non-CSI volumes above what the driver detects is already in use on startup. These reserved slots will be subtracted from the total slots reported to Kubernetes.

This parameter can be used when the maximum number of slots that will be used by ENIs/non-CSI volumes/etc is known in advance.

Helm Additional DaemonSets

For clusters that need a mix of the above solutions, the Helm chart has the ability to construct multiple DaemonSets via the additionalDaemonSets parameter.

For example, the below Helm configuration would configure three different DaemonSets each with their own specific attahc limit:

node:
 nodeSelector:
   node.kubernetes.io/instance-type: c5.large
 volumeAttachLimit: 25

additionalNodeDaemonSets:
 big:
   nodeSelector:
     node.kubernetes.io/instance-type: m7i.48xlarge
   volumeAttachLimit: 100
 small:
   nodeSelector:
     node.kubernetes.io/instance-type: t3.medium
   volumeAttachLimit: 5

For more information about this feature, see the Additional DaemonSets docs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests