-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect allocatable volumes count in csinode for AWS vt1*/g4* instance types #2105
Comments
Hey @mpatlasov, thank you for raising this issue up! We will add this count of accelerators for these instance types to node startup by next release (as well as any other devices that we are missing). Really appreciate the detailed ramp up and resources on this! /assign @ElijahQuinones |
@AndrewSirenko: GitHub didn't allow me to assign the following users: ElijahQuinones. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/priority important-soon |
Hi @mpatlasov, The PR for Gpus not being factored in has already been merged, and the PR for accelerators is in review right now. As for your observation: | There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically The VT instance type is special in that both the vt1.3xlarge and vt1.6xlarge have accelerators that take up two attachment slots each. As for the vt1.24xlarge it's accelerators do not take up any attachment slots at all. This is not well documented and I have cut an internal documentation ticket to correct this. Please let me know if you have any further questions or concerns! |
/kind bug
What happened?
kubectl get csinode <node-name> -o json | jq .spec.drivers
says that allocatable.count is 26 for vt1* instance types and 25 for g4* ones. While actual number of volumes that can be attached to the node is smaller:type / reported / actual
g4dn.xlarge / 25 / 24
g4ad.xlarge / 25 / 24
vt1.3xlarge / 26 / 24
vt1.6xlarge / 26 / 22
There are many other g4* instance types mentioned here, but I verified the issue only for g4dn.xlarge and g4ad.xlarge. Reported number for vt1.24xlarge (26) is correct, while numbers for other vt1* types are not.
What you expected to happen?
kubectl get csinode
must report correct max number of volumes to be attached.How to reproduce it (as minimally and precisely as possible)?
Apply the following StatefulSet with 26 replicas:
In a while some pods get stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node. An error for a pod which got stuck looks like that:
Anything else we need to know?:
Official doc "Amazon EBS volume limits for Amazon EC2 instances" states clearly that GPU (or accelerators) must be counted:
While getVolumesLimit() doesn't take care. It starts from availableAttachments=28 for Nitro instances, then applies the following arithmetic:
e.g. 28 - 1 - 1 - 1 == 25 for g4ad.xlarge.
There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically:
type / reported / actual
vt1.3xlarge / 26 / 24
vt1.6xlarge / 26 / 22
vt1.24xlarge / 26 / 26
I.e., it's hard to explain <24 , 22 , 26> solely from number-of-accelerators considerations.
Environment
kubectl version
):Compiled manually (by
docker build -t quay.io/rh_ee_mpatlaso/misc:aws-ebs-csi-drv-upstream -f Dockerfile .
) from the head of master branch of https://github.com/kubernetes-sigs/aws-ebs-csi-driver :The text was updated successfully, but these errors were encountered: