instance volume limits: workloads no longer attach ebs volumes #1163

aydosman · 2022-02-04T11:16:17Z

/kind bug

What happened?
Workloads stop attaching ebs volumes due to reaching instance volume limits, expected number of replicas for our requirement isn’t met and pods are in a pending state.

Nodes have the appropriate limit set to 25 but the scheduler sends more than 25 pods with volumes to a node.

kubelet Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[kube-api-access-redact test-volume]: timed out waiting for the condition

attachdetach-controller AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

ebs-csi-controller driver.go:119] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

How to reproduce it (as minimally and precisely as possible)?

Deploying the test below should be sufficient in simulating the problem

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Update: adding a liveness probe with an initial delay of 60 seconds seems to get around the problem, our nodes scale, the replica count is correct with volumes attached.

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 60
          periodSeconds: 10          
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Environment

Kubernetes version: Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Version: Helm Chart: v2.6.2 Driver v1.5.0

The text was updated successfully, but these errors were encountered:

ryanpxyz · 2022-02-16T13:19:07Z

Hello,

I believe we are seeing this too.

Warning  FailedAttachVolume  34s (x11 over 8m47s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-61b4bf2c-541f-4ef1-9f21-redacted" : rpc error: code = Internal desc = Could not attach volume "vol-redacted" to node "i-redacted": attachment of disk "vol-redacted" failed, expected device to be attached but was attaching

...with:

# /bin/kubelet --version
Kubernetes v1.20.11-eks-f17b81

and:

k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.4.0

Thanks,

Phil.

stevehipwell · 2022-02-16T16:23:57Z

I suspect this is a race condition somewhere, my current thoughts are that it's the scheduler but I haven't had a chance to look at it further.

gnufied · 2022-02-16T16:33:00Z

is CSI driver running with correctly defined limits? What does CSINode object from node reports?

stevehipwell · 2022-02-16T17:03:49Z

@gnufied the CSI driver looks to be doing everything correctly, AFAIK the only thing it needs to do is report the max PV attachments it can make. As reported above if you add in latency between pod scheduling the pods are sent to nodes with space for the PV mounts which is why I suspect it's a scheduler issue.

gnufied · 2022-02-16T17:06:54Z

@stevehipwell No that shouldn't happen. We start counting volumes against the limit before pods are even started on the node. I am still waiting on output of CSINode object from the problematic node.

stevehipwell · 2022-02-16T17:10:37Z

@gnufied I agree that the CSI driver is reporting correctly, which combined with the 60s wait fixing the issue makes me believe that this issue is actually happening elsewhere as a race condition.

sultanovich · 2022-02-16T18:52:43Z

I am seeing the same problem in my environment:

Events:
  Type     Reason       Age   From                                   Message
  ----     ------       ----  ----                                   -------
  Normal   Scheduled    2m5s  default-scheduler                      Successfully assigned 2d3b9d81e0b0/master-0 to ip-10-3-109-222.ec2.internal
  Warning  FailedMount  2s    kubelet, ip-10-3-109-222.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[master], unattached volumes=[master backups filebeat-configuration default-token-vjscf]: timed out waiting for the condition

I haven't been able to confirm if they are related, but I see this happening on the same node that has eni's interfaces in "attaching" state and I see the following errors in /var/log/aws-routed-eni/ipamd.log:

{"level":"error","ts":"2022-02-01T19:45:34.410Z","caller":"ipamd/ipamd.go:805","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI: error attaching ENI: attachENI: failed to attach ENI:AttachmentLimitExceeded: Interface count 9 exceeds the limit for c5.4xlarge\n\tstatus code: 400, request id: 836ce9b1-ec63-4935-a007-739e32f506cb"}

For reference, the c5.x4xlarge instance type in AWS supports 8 eni's.

We are testing why we can limit it using the volume-attach-limit variable which is not set yet, but I would like to understand first why it happens and if there is a way to not hardcode that value.

Environment

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

[user@admin [~] ~]$ kubectl -n kube-system get deployment ebs-csi-controller -o wide -o yaml | grep "image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver"
        image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
[user@admin [~] ~]$

[root@i-6666919f09cc78046 ~]# /usr/bin/kubelet --version
Kubernetes v1.19.15-eks-9c63c4
[root@i-6666919f09cc78046 ~]#

ryanpxyz · 2022-02-17T05:34:16Z

Hello,

... update from our side:

Our first simple workaround as we first observed the problem yesterday (might help others who are stuck and looking for a 'quick fix'):

... cordon the current node that the pod is stuck in 'Init ...' on.
... delete the pod ...
... verify that the pod is started successfully on an alternative node. If not ...
... repeat 'cordoning' until the pod is successfully deployed.
... uncorden (all) node(s) upon successful deployment.

Then following a dive into the CSI EBS driver code, we passed the option '--volume-attach-limit=50' to the 'node driver'. I haven't tested this explicitly yet however.

The problem to me seems to be a missing feedback loop between the 'node driver' and the scheduler.

The scheduler says, "Hey, there's a node that satisfies my scheduling criteria ... I'll schedule the workload to run there ..." and the node driver says, "OK, I have a workload but I've reached this '25 attached volumes' limit so I'm done here ...".

This is just my perhaps primitive view of the situation.

Thanks,

Phil.

PS ... following a re-deployment of the 'csi ebs node driver' we are still seeing the attribute 'attachable-volumes-aws-ebs' as set to 25 on a 'describe node':

... we weren't expecting this.

stevehipwell · 2022-02-17T05:48:27Z

@ryanpxyz looking at the code I think the CSI just reports how many attachments it can make. Until the PR to make this dynamic is merged and released this is a fixed value by instance type or arg. This means there are two related but distinct issues.

The first is the incorrect max value that doesn't take into accoun all nitro instances and their other attachments. For example a nitro instance (only 5 series) and no arg will have a limit of 25, which is correct as long as you only have 3 extra attachments. If you're using custom networking and prefixes this means instances without an additional NVMe drive work but ones with this get stuck.

The second problem, which is what this issue is tracking, is that when meeting the criteria for a correctly reported max it is still possible that too many pods will be scheduled on a node.

stevehipwell · 2022-02-17T05:51:17Z

@sultanovich see my reply above about the attachment limits. I think there is a separate issue and PR for resolving your problem.

aydosman · 2022-02-17T08:28:29Z

@gnufied output from the workers while running the original tests with limits

Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:34:19 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>



Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:44:10 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>

sultanovich · 2022-02-21T17:10:41Z

@stevehipwell I have no doubts as to the limits that can be annexed. My question is about why it happens and how to solve it.

I have generated a new issue (#1174) since the volume-attach-limit argument has not worked for me either.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

stevehipwell · 2022-02-21T17:24:45Z

@sultanovich I think I've explained pretty well everything I know about this issue. Let me reiterate that there are two bugs here, the first one which is related to nodes not being picked up as nitro or not having 25 free attachment slots is being addressed by #1075, the second and currently unexplained is related to the speed at which requests for pods with PVs are sent to the scheduler. The second scenario is what this issue was opened for, with your new issue there are now a number of other issues relating to the first scenario.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

The current driver doesn't take any dynamic attachments into consideration, you get 25 if the node is detected as nitro or 39 if not. If you are getting failures on nitro instance that isn't a 5 series, or has NVMe drives or is using more than 2 ENIs you should be able to statically fix the problem by using the --volume-attach-limit argument. If you're using a m5 instance but requesting lots of PVs it's likely that you're seeing this issue; you should be able to stop it happening by changing your deployment strategy and adding a wait between pods.

gnufied · 2022-02-21T17:37:35Z

@ryanpxyz you are looking at wrong place for attachable limits of CSI driver. Attach limit of CSI driver is reported via CSINode objects. if we are not rebuilding CSINode objects during redeploy of driver - that sounds like a bug. So setting --volume-attach-limit and redeploying driver should set correct limits.

As for bug in scheduler - here is the code for counting the limits https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L210 . Its been awhile since I looked in to the scheduler code, but if scheduler is not respecting limits reported by CSINode then that would be a k/k bug (and we are going to need one).

gnufied · 2022-02-21T17:39:10Z

@bertinatto - are you aware of a bug where if many pods are scheduled at once to a node then scheduler may not correctly count the volume limits?

stevehipwell · 2022-02-22T09:49:51Z

@gnufied it looks like it's the Filter function that is doing the work we're interested in. Unless only a single pod can be scheduled at a time, which is unlikely, this code looks like it isn't checking for other in flight requests and could easily result in over provisioning volumes on a node.

I would expect to see something to lock a CSINode so only one calculation at a time could run, but I might be missing something here as I'm not really familiar with this part of the codebase?

As an aside would supporting Storage Capacity Tracking help limit the blast radius of this issue?

sultanovich · 2022-03-14T13:54:24Z

@gnufied I tried setting the --volume-attach-limit argument in a test environment and it worked fine.
The only limitation that I find is that it applies to the entire cluster, if I have nodes, the other types of instances in AWS could limit the number of volumes that I can host, increasing infrastructure costs.

Do you have any idea how long it might take to modify this check to take the correct limits on all instance types?

stevehipwell · 2022-03-15T16:18:41Z

@sultanovich this issue isn't the same as #1174, please don't confuse them.

stevehipwell · 2022-03-29T09:24:43Z

@gnufied @bertinatto do you have any more thoughts on this? I doubt I've read the code correctly so would appreciate someone looking at the code I mentioned above to see if they can see the same potential issue?

stevehipwell · 2022-04-12T14:09:22Z

On further testing of this it looks like this has been fixed via an EKS platform version update (I suspect), I'd be interested if anyone knows what exactly was fixed?

jrsdav · 2022-04-20T16:06:06Z

@stevehipwell The EKS AMI changelog for the most recent v20220406 release had one interesting note that might be relevant:

The bootstrap script will auto-discover maxPods values when instanceType is missing in eni-max-pods.txt

stevehipwell · 2022-04-20T16:15:10Z

@jrsdav thanks for looking out but that functionality sets a kubelet arg (incorrectly in most cases) and isn't related to storage attachments. This issue wasn't ever about the correct max value being set for attachments, that's a separate issue with a fix coming in the next minor version, it was a scheduling issue that didn't make much sense.

LeeHampton · 2022-04-29T17:46:22Z

We're experiencing this issue as well. Except, from some of the discussion above it sounds like people think it's some kind of scheduling race condition. In our case, it seems like the volume attachments are never being properly counted. We have a node with 25 attachments, but the Allocated Resources section under kubectl describe node show zero attachments:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests        Limits
  --------                    --------        ------
  cpu                         35170m (73%)    38 (79%)
  memory                      130316Mi (68%)  131564Mi (68%)
  ephemeral-storage           0 (0%)          0 (0%)
  hugepages-1Gi               0 (0%)          0 (0%)
  hugepages-2Mi               0 (0%)          0 (0%)
  attachable-volumes-aws-ebs  0               0

Any leads on what might be causing that to happen?

gnufied · 2022-04-29T17:48:16Z

Again looks like you are looking at wrong object. CSI volume limits are counted via CSINode objects. So please check what value it is reporting.

LeeHampton · 2022-04-29T17:53:25Z

@gnufied Ah, okay. Thank you. It looks like the "allocatables" are indeed being properly counted, which I guess puts us in the race condition boat:

k describe csinode  ip-172-20-60-87.ec2.internal


Name:               ip-172-20-60-87.ec2.internal
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Wed, 27 Apr 2022 05:12:27 -0400
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-0f37978c6d1e25a52
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone topology.kubernetes.io/zone]
Events:               <none>

LeeHampton · 2022-04-29T17:56:10Z

@gnufied , actually is "Allocatables" just the total limit? How do I see what it thinks is currently allocated?

Legion2 · 2022-05-07T10:17:10Z

We are using csi volumes and in-tree volumes at the same time and see similar errors. Even if csi volumes are counted correctly, there are also non csi volumes attached to the nodes which result in the underlying node limited to be exceeded. Is this situation addressed by any of the linked issues?

dyasny · 2022-11-03T15:41:08Z

So within the context of my use case - when I try a lot of small workloads on an m6a Node which is otherwise capable of supporting hundreds of Pods I am inevitably going to run into the issue of "running out" of available attachments if all my Pods require their own volume.

To make matters worse a large number of small Pods all requiring IP addresses increases the amount of ENI attachments on my Node which further lowers my available EBS attachments.

My use case exactly. Essentially, this is AWS forcing you to pay for more instances. I am currently working on two things:

Drop VPC-CNI for some overlay-based setup, this should mitigate the ENI attachment limitation (yes I am aware of the prefixes hack and it still doesn't cut it)
Drop the usage of EBS for something self managed and more suitable for the use case of many small pods with many small volumes attached.

stevehipwell · 2022-11-03T15:54:55Z

@sotiriougeorge not directly CSI related but I'd suggest switching over to IP Prefix mode which should mean you only need a single ENI (or 2 for custom networking). Secondly according to to the Kubernetes documentation 110 pods per node is the upper limit and is a good rule of thumb, the original EKS limit is based on the maximum IPs per instance which I can't see any real justification for once it passed the 110 value. Thirdly Kubernetes isn't designed for primarily stateful workloads and where they are used it's usually for a service which has a high resource requirement meaning that you don't need to bin pack lots of pods onto the same node.

Out of interest what sort of load testing needs actual volumes per pod rather than just an emptyDir?

sotiriougeorge · 2022-11-03T16:01:41Z

@stevehipwell the concept is that the platform I am working on (obviously backed up by EKS at this point) offers its end-users the option to deploy their own workloads, all of which ... or rather most of which are backed by EBS volumes.

So the stress-testing of the cluster aims to discover what would happen if the users "went ham" on the platform and what kind of restrictions should be put in place as far as workload deployments are concerned.

Out of interest what sort of load testing needs actual volumes per pod rather than just an emptyDir?

I'd say it's more of a "volumes per most Pods".

I thank you for your suggestions though , the IP Prefix mode is something I had seen but hadn't found the time to deep dive and see how it would help me - sometimes you can only absorb so much new info. I'm done hijacking this thread!

k8s-triage-robot · 2023-02-01T16:32:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

greenaar · 2023-02-12T19:51:01Z

/remove-lifecycle stale

k8s-triage-robot · 2023-06-20T22:04:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

greenaar · 2023-06-25T19:28:19Z

/remove-lifecycle stale

tristanaj · 2023-10-11T08:09:46Z

Unless I'm mistaken, this still seems to be an issue in Kubernetes v1.28 (on EKS) with version v1.23.1 of the EBS CSI Driver. The following (albeit unrealistic) example reproduces the problem, by trying to send 26 pods in a statefulset with one PVC each to the same node. I would hope the scheduler wouldn't do this, but instead on my nodes it gets to 24 pods and the 24th gets stuck in a pending state complaining that it can't attach the volume. Is this likely to be fixed in an upcoming release? It's causing us major problems. Obviously we're not trying to send 26 pods with PVCs to the same node, but intermittently in our application the scheduler tries to schedule a pod with a PVC that won't attach because the attachment quota has been breached, causing downtime and instability. Is there any workaround for this? Thanks in advance.

apiVersion: v1
kind: Namespace
metadata:
  name: vols-test
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols-test
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 26
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      affinity:
        podAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
           - labelSelector:
              matchExpressions:
               - key: app
                 operator: In
                 values:
                 - vols-pv-test
             topologyKey: "kubernetes.io/hostname"
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "ebs-sc" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

tristanaj · 2023-10-11T08:44:06Z

/remove-lifecycle stale

idanshaby · 2023-11-14T11:28:37Z

Happens to me as well.
So do we have a bug on the scheduler? Because it indeed looks like the driver reports whatever it needs to report.
Other than the workaround of setting the volumeAttachLimit parameter on the CSI add-on (which practically imposes the limitation of using a single instance type in the whole cluster), did anyone find any better workaround?

torredil · 2023-11-14T16:39:12Z

@idanshaby Take a look at the recently introduced additional DaemonSets feature: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/additional-daemonsets.md, this will enable you to specify different volume attachment limits per instance type.

idanshaby · 2023-11-15T06:57:26Z

Looks interesting, @torredil . Thanks for that!
I guess it will take some time until the EKS add-on will consume it, but worth waiting.

torredil · 2023-11-15T18:49:05Z

@idanshaby The add-on schema has already been updated to include this parameter!

$ eksctl utils describe-addon-configuration --name aws-ebs-csi-driver --version v1.25.0-eksbuild.1 | yq

    "additionalDaemonSets": {
      "default": {},
      "description": "Additional DaemonSets of the node pod",
      "patternProperties": {
        "^.*$": {
          "$ref": "#/properties/node",
          "type": "object"
        }
      },
      "type": "object"
    },

ckhelifi · 2024-02-01T14:49:52Z

Hi there,

I have a similar issue and i presume that the ebs-node calculate only at startup the volume attachment instance limit but this one can change if network interfaces are attached in a second time on the node (this is what Amazon VPC CNI plugin does).

Am i right ?
The fact that allocatable.count value change when i restarted the ebs-csi-node on my node let me think about that :

kubectl get csinode ip-xx-xx-xxx-xx.eu-west-3.compute.internal -o yaml
---
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2024-01-31T20:17:17Z"
  name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
    uid: 44b1435c-8af5-455c-a8fa-5070512f623a
  resourceVersion: "3089141802"
  uid: 2eedb777-da67-40ba-8aee-e6d844f8ec37
spec:
  drivers:
  - name: efs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys: null
  - allocatable:
      count: 26
    name: ebs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys:
    - topology.ebs.csi.aws.com/zone
---
kubectl delete pod ebs-csi-node-w6gvq -n kube-system
pod "ebs-csi-node-w6gvq" deleted
---
kubectl get csinode ip-xx-xx-xxx-xx.eu-west-3.compute.internal -o yaml
apiVersion: storage.k8s.io/v1
kind: CSINode
metadata:
  annotations:
    storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/azure-file,kubernetes.io/cinder,kubernetes.io/gce-pd,kubernetes.io/vsphere-volume
  creationTimestamp: "2024-01-31T20:17:17Z"
  name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
  ownerReferences:
  - apiVersion: v1
    kind: Node
    name: ip-xx-xx-xxx-xx.eu-west-3.compute.internal
    uid: 44b1435c-8af5-455c-a8fa-5070512f623a
  resourceVersion: "3090762035"
  uid: 2eedb777-da67-40ba-8aee-e6d844f8ec37
spec:
  drivers:
  - name: efs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys: null
  - allocatable:
      count: 22
    name: ebs.csi.aws.com
    nodeID: i-0168576e9bfa6f710
    topologyKeys:
    - topology.ebs.csi.aws.com/zone

k8s-triage-robot · 2024-05-01T16:38:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-05-31T16:52:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

AndrewSirenko · 2024-06-06T15:47:24Z

/remove-lifecycle rotten

ConnorJC3 · 2024-08-06T14:22:23Z

/close

Hey everyone, we're going to go ahead and close out this issue as it has become a mess of related "volume limit" related issues. Below is the current status of volume limits in the EBS CSI Driver.

If you are experiencing a volume limit related bug on a supported version of the EBS CSI Driver that is not described below, please open a new issue for evaluation.

Driver Version

Everything below only applies if you're using a supported version of the driver - that is one of the most two recent minor releases. Older versions of the driver do not receive bug fixes and contain known issues related to volume limits, and will not receive support for these issues. See the support policy for more information.

If you are on an older version of the driver and experiencing a volume limit related issue, the first step is to upgrade to the latest stable version to ensure your issue is not due to a bug that has already been fixed.

Volume Limit Calculation

Currently, the EBS CSI driver can receive metadata from two sources - either IMDS or the Kubernetes API. Both sources provide the instance type and name, but only IMDS provides the number of ENIs and attached volumes to the instance.

Thus, if an instance has more than 1 ENI attached, or any non-CSI EBS volumes attached other than the root volume (such as an extra data volume for /var/lib/containerd/), the volume limit calculation will only be correct if using IMDS metadata.

We are also aware of an issue where the limit is incorrectly calculated on instances with GPUs or accelerators - that is being tracked in #2105 and we plan to fix soon.

Changing Volume Limit

Currently, Kubernetes only probes CSI Drivers for the volume limit during startup. This means that if an additional slot is taken by something other than a CSI volume (such as an additional ENI being attached to the volume.

This is common when using the VPC CNI plugin, as it will sometimes attach addiitonal ENIs if the number of IPs that are available reaches zero. This can be partially mitigated by using VPC CNI's prefix delegation feature.

Solutions

We are looking into proposing ways the CSI spec and upstream components can be improved to better account for the dynamic nature of volume limits on EC2/EBS. In the meantime, we recommend adopting one or more of the below solutions if you are experiencing volume limit issues caused by the limit changing after startup:

Dedicated EBS Instance Types

Some gen7 and later EC2 instance types have a dedicated EBS volume limit. If you are experiencing issues due to slots being taken by non-volume attachments, such as ENIs allocated by the VPC CNI driver, using these instance types can remediate that issue because their volume slots are dedicated solely for EBS and will not be used by ENIs/GPUs/etc.

Volume Attach Limit Configuration

The EBS CSI Driver can be started with the CLI option --volume-attach-limit (Helm parameter node.volumeAttachLimit) to explicitly specific the limit for volumes to be reported to Kubernetes.

This parameter can be used in cases where you have a known safe limit.

Reserved Volume Attachments

The EBS CSI Driver can be started with the CLI option --reserved-volume-attachments (Helm parameter node.reservedVolumeAttachments) to reserve a number of slots for non-CSI volumes above what the driver detects is already in use on startup. These reserved slots will be subtracted from the total slots reported to Kubernetes.

This parameter can be used when the maximum number of slots that will be used by ENIs/non-CSI volumes/etc is known in advance.

Helm Additional DaemonSets

For clusters that need a mix of the above solutions, the Helm chart has the ability to construct multiple DaemonSets via the additionalDaemonSets parameter.

For example, the below Helm configuration would configure three different DaemonSets each with their own specific attahc limit:

node:
  nodeSelector:
    node.kubernetes.io/instance-type: c5.large
  volumeAttachLimit: 25

additionalNodeDaemonSets:
  big:
    nodeSelector:
      node.kubernetes.io/instance-type: m7i.48xlarge
    volumeAttachLimit: 100
  small:
    nodeSelector:
      node.kubernetes.io/instance-type: t3.medium
    volumeAttachLimit: 5

For more information about this feature, see the Additional DaemonSets docs

k8s-ci-robot · 2024-08-06T14:22:28Z

@ConnorJC3: Closing this issue.

In response to this:

/close

Hey everyone, we're going to go ahead and close out this issue as it has become a mess of related "volume limit" related issues. Below is the current status of volume limits in the EBS CSI Driver.

If you are experiencing a volume limit related bug on a supported version of the EBS CSI Driver that is not described below, please open a new issue for evaluation.

Driver Version

Everything below only applies if you're using a supported version of the driver - that is one of the most two recent minor releases. Older versions of the driver do not receive bug fixes and contain known issues related to volume limits, and will not receive support for these issues. See the support policy for more information.

If you are on an older version of the driver and experiencing a volume limit related issue, the first step is to upgrade to the latest stable version to ensure your issue is not due to a bug that has already been fixed.

Volume Limit Calculation

Currently, the EBS CSI driver can receive metadata from two sources - either IMDS or the Kubernetes API. Both sources provide the instance type and name, but only IMDS provides the number of ENIs and attached volumes to the instance.

Thus, if an instance has more than 1 ENI attached, or any non-CSI EBS volumes attached other than the root volume (such as an extra data volume for /var/lib/containerd/), the volume limit calculation will only be correct if using IMDS metadata.

We are also aware of an issue where the limit is incorrectly calculated on instances with GPUs or accelerators - that is being tracked in #2105 and we plan to fix soon.

Changing Volume Limit

Currently, Kubernetes only probes CSI Drivers for the volume limit during startup. This means that if an additional slot is taken by something other than a CSI volume (such as an additional ENI being attached to the volume.

This is common when using the VPC CNI plugin, as it will sometimes attach addiitonal ENIs if the number of IPs that are available reaches zero. This can be partially mitigated by using VPC CNI's prefix delegation feature.

Solutions

We are looking into proposing ways the CSI spec and upstream components can be improved to better account for the dynamic nature of volume limits on EC2/EBS. In the meantime, we recommend adopting one or more of the below solutions if you are experiencing volume limit issues caused by the limit changing after startup:

Dedicated EBS Instance Types

Some gen7 and later EC2 instance types have a dedicated EBS volume limit. If you are experiencing issues due to slots being taken by non-volume attachments, such as ENIs allocated by the VPC CNI driver, using these instance types can remediate that issue because their volume slots are dedicated solely for EBS and will not be used by ENIs/GPUs/etc.

Volume Attach Limit Configuration

The EBS CSI Driver can be started with the CLI option --volume-attach-limit (Helm parameter node.volumeAttachLimit) to explicitly specific the limit for volumes to be reported to Kubernetes.

This parameter can be used in cases where you have a known safe limit.

Reserved Volume Attachments

The EBS CSI Driver can be started with the CLI option --reserved-volume-attachments (Helm parameter node.reservedVolumeAttachments) to reserve a number of slots for non-CSI volumes above what the driver detects is already in use on startup. These reserved slots will be subtracted from the total slots reported to Kubernetes.

This parameter can be used when the maximum number of slots that will be used by ENIs/non-CSI volumes/etc is known in advance.

Helm Additional DaemonSets

For clusters that need a mix of the above solutions, the Helm chart has the ability to construct multiple DaemonSets via the additionalDaemonSets parameter.

For example, the below Helm configuration would configure three different DaemonSets each with their own specific attahc limit:
node:
 nodeSelector:
   node.kubernetes.io/instance-type: c5.large
 volumeAttachLimit: 25

additionalNodeDaemonSets:
 big:
   nodeSelector:
     node.kubernetes.io/instance-type: m7i.48xlarge
   volumeAttachLimit: 100
 small:
   nodeSelector:
     node.kubernetes.io/instance-type: t3.medium
   volumeAttachLimit: 5
For more information about this feature, see the Additional DaemonSets docs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 4, 2022

sultanovich mentioned this issue Mar 15, 2022

volume-attach-limit argument doesn't work in 1.20 #1174

Closed

steved mentioned this issue Jun 16, 2022

Race condition with volume attach limits when new node joins the cluster leads to oversubscribed node #1278

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 12, 2023

torredil removed the kind/bug Categorizes issue or PR as related to a bug. label Mar 22, 2023

m0n0x41d mentioned this issue May 3, 2023

[cloud-provider-aws] CSINode does not reflect --volume-attach-limit from aws-ebs-csi-driver deckhouse/deckhouse#4556

Open

2 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 20, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 31, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 6, 2024

k8s-ci-robot closed this as completed Aug 6, 2024

instance volume limits: workloads no longer attach ebs volumes #1163

instance volume limits: workloads no longer attach ebs volumes #1163

Comments

aydosman commented Feb 4, 2022 • edited Loading

ryanpxyz commented Feb 16, 2022

stevehipwell commented Feb 16, 2022

gnufied commented Feb 16, 2022

stevehipwell commented Feb 16, 2022

gnufied commented Feb 16, 2022

stevehipwell commented Feb 16, 2022

sultanovich commented Feb 16, 2022

Environment

ryanpxyz commented Feb 17, 2022 • edited Loading

stevehipwell commented Feb 17, 2022

stevehipwell commented Feb 17, 2022

aydosman commented Feb 17, 2022

sultanovich commented Feb 21, 2022

stevehipwell commented Feb 21, 2022

gnufied commented Feb 21, 2022

gnufied commented Feb 21, 2022

stevehipwell commented Feb 22, 2022

sultanovich commented Mar 14, 2022

stevehipwell commented Mar 15, 2022

stevehipwell commented Mar 29, 2022

stevehipwell commented Apr 12, 2022

jrsdav commented Apr 20, 2022

stevehipwell commented Apr 20, 2022

LeeHampton commented Apr 29, 2022

gnufied commented Apr 29, 2022

LeeHampton commented Apr 29, 2022

LeeHampton commented Apr 29, 2022

Legion2 commented May 7, 2022

dyasny commented Nov 3, 2022

stevehipwell commented Nov 3, 2022

sotiriougeorge commented Nov 3, 2022 • edited Loading

k8s-triage-robot commented Feb 1, 2023

greenaar commented Feb 12, 2023

k8s-triage-robot commented Jun 20, 2023

greenaar commented Jun 25, 2023

tristanaj commented Oct 11, 2023 • edited Loading

tristanaj commented Oct 11, 2023

idanshaby commented Nov 14, 2023

torredil commented Nov 14, 2023

idanshaby commented Nov 15, 2023

torredil commented Nov 15, 2023

ckhelifi commented Feb 1, 2024 • edited Loading

k8s-triage-robot commented May 1, 2024

k8s-triage-robot commented May 31, 2024

AndrewSirenko commented Jun 6, 2024

ConnorJC3 commented Aug 6, 2024

Driver Version

Volume Limit Calculation

Changing Volume Limit

Solutions

Dedicated EBS Instance Types

Volume Attach Limit Configuration

Reserved Volume Attachments

Helm Additional DaemonSets

k8s-ci-robot commented Aug 6, 2024

Driver Version

Volume Limit Calculation

Changing Volume Limit

Solutions

Dedicated EBS Instance Types

Volume Attach Limit Configuration

Reserved Volume Attachments

Helm Additional DaemonSets

aydosman commented Feb 4, 2022 •

edited

Loading

ryanpxyz commented Feb 17, 2022 •

edited

Loading

sotiriougeorge commented Nov 3, 2022 •

edited

Loading

tristanaj commented Oct 11, 2023 •

edited

Loading

ckhelifi commented Feb 1, 2024 •

edited

Loading