-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck terminating with migrated volumes #679
Comments
When you say new pods work fine, you mean pods that explicitly specify csi? like pvs that say "spec.csi" and are not going through migration? And the ones that don't are the ones that say "spec.awsebs" and are going through migration? Could you share the PV volume spec ( can redact region/volid info if you wish), IIRC there are a bunch of accepted formats like aws://vol-id and I want to check which format you are using, and use the same one to reproduce the issue. It's just a suspicion I have because the code derives the mount paths and such using the volume spec, like this one
Besides that, just looking at the logs |
Correct, new pods that are using a PVC created from a CSI storageclass that have not gone through migration can be created/deleted fine.
This is from a StatefulSet and the pod when deleted hangs on termination status
|
@appprod0 did you drain kubelet before applying the CSIMigrationAWS feature-gate and restarting kubelet? We had to do that for the migration flag to work otherwise we get the exact same error as you. I don't think you can just turn on the feature-gate, delete the existing pod(s), and expect it to work. Hence why your new node on from your ASG works. |
@wongma7 , I think this is reproducible using the following test:
Taking the following steps:
I've also confirmed that this is the same path that's on the node:
In this case we're seeing the path as
So here it's reported as awsebs in the PV's spec, but it's provisioned by the CSI driver. Here's an excerpt from the external-provisioner:
This failure is consistently reproducible when using the aforementioned test. I've added it here, as it appears to be very similar (if not the same). Would you have any thoughts on this one? |
In the original issue the paths were different, and it looks like it casn happen if feature gate is enabled without draining. In this one the paths are same, and actually I think the path " /var/lib/kubelet/pods/bcc53e0c-3c2d-453a-aa13-7e81f67c1c6e/volumes/kubernetes.io~csi/pvc-8a72162b-db8b-4499-b9e4-623fe22241f5/moun" looks correct. even in migration scenario the path should be csi . however I'm not that familiar with this test case, what is responsible for unmounting the path if kubelet is down |
Just looking through the test I don't see any explicit unmounts. It seems that the primary unmounting is expected in |
OK I checked that test and it stops kubelet, deletes the pod, then starts kubelet, so I misunderstood I thought there was an expectation that the volume gets unmounted while kubelet is down, but it's testing that the volume gets unmounted after kubelet comes back up. So yeah if we are failing that, it's a real bug, I think it has to do with kubelet's volume reconstruction logic (i.e. reconstruct its cache of what volumes are mounted given the paths of volumes on disk) so will have to repro and check kubelet logs to be sure what's going on. |
It started happening to me after cluster upgrade from 1.19.x to 1.20.x |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
I believe I'm seeing the same issues after upgrading from 1.19.6 to 1.20.9 too. Granted I do no believe the volumes in question have been migrated. As far as I can the kubetlet asks the CSI driver to unmount the volume, it logs that it does but it is still mounted. If I manually unmount the volume from the linux host kubelet will finish Terminating the pod or if I restart kubelet it will ask the CSI driver to unmount the volume again and it appears to always work the second time. I did notice a general block device error for that volume in the system logs around that time too.
|
@gregsidelinger This issue might also be relevant to you #1027 I'm able to replicate this on all volumes, not just migrated or ephemeral. There's a fix in v.1.2.1 that is supposed to resolve this but after using that version I can still replicate the stuck pods terminating. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten I have not seen this recently, is it just luck or in 1.20 it's not a problem anymore? |
@yhrunyk I have not upgraded yet, and I don't see the problem on 1.21 (it was a typo in my previous message, this one is right) |
Seeing the same issue on our cluster. We are on EKS v.1.20 and using AWS EBS CSI driver 1.5.0. |
Same issue with EKS v1.21, the volume's been terminating for hours. Not sure which EBS CSI driver version we're using or how to check. We don't have it installed as a addon. I ended up deleting the pod that used the volume and then the pvc delete worked. |
I'm seeing this same thing on EKS 1.20 without much luck resolving. Has anyone on this issue solved this yet? |
@wongma7 Have you revisited this at all? I'm having a hard time determining if it's Kubelet or some EBS CSI component. Our recent error message:
|
In my case, pod stuck in Init state after migrating from in-tree plugin to EBS CSI, I followed this doc - https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/install.md#migrating-from-in-tree-ebs-plugin and did same steps, but got these logs - Sep 05 10:16:42 ip-10-222-101-237.fcinternal.net kubelet[8233]: E0905 10:16:42.916187 8233 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-02378debf01dbaee9 podName: nodeName:}" failed. No retries permitted until 2022-09-05 10:18:44.916161378 +0000 UTC m=+7250.080135420 (durationBeforeRetry 2m2s). Error: "Volume not attached according to node status for volume "pvc-3eac9d49-b875-4258-849b-7ca94f78014c" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-02378debf01dbaee9") pod "elasticsearch-ha-master-sts-0" (UID: "839f19da-c301-47c0-9970-909ddfad92e4") " Has anyone got the solution? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hello, i'm having an issue deleting pods that have migrated PVs. after installing aws-ebs-csi-driver (0.7.1) and enabling CSIMigrationAWS, new PVs/pods work fine from a CSI storageclass, they can be restarted and terminated normally. I also see that all PVs were migrated to CSI and show attached with
volumeattachment
. resizes on those migrated PVs also work fine. however, when I delete a pod with a migrated PV it hangs in Terminating status. the only workaround i've found is to terminate the instance and have the ASG spin up a new one. the pod then gets rescheduled and it can mount the volume. this is a cluster created on EC2 with kubeadm/v1.18.6. this seems to be the primary error message being looped in the worker node kubelet logs:node ebs-plugin logs
No other useful logs for the unmount ^, so it seems like it hangs there.
Volume is not attached to the node, confirmed through AWS console, but still has a mount point.
The text was updated successfully, but these errors were encountered: