Kubernetes pods losing connection to CEPH during pfsense failover #4774

brettjenkins · 2024-08-16T16:03:44Z

brettjenkins
Aug 16, 2024

Hey,

I've been experiencing an intermittent but pretty severe issue that I wanted to raise here.

Environment details

Image/version of Ceph CSI driver : v3.10.2 (I've since upgraded to v3.12.0)
Helm chart version : ceph-csi-3.12.0
Kernel version : 6.8.0-40-generic
Mounter used for mounting PVC : krbd (i think)
Kubernetes cluster version : v1.30.1
Ceph cluster version : v17.2.7 (in proxmox 8.0.4)

Background:

I first encountered this issue during some router maintenance. I had to reinstall pfSense, so I created a temporary VM, restored my backup to that VM, and kept the network running while I reformatted the bare metal server. However, when I switched back to the bare metal server, my Kubernetes cluster started behaving strangely. Applications became unresponsive; for instance, Emby would begin to load its page but then fail midway. Deleting the affected pods would hang; the old pods would timeout during deletion, preventing new pods from starting.

The cluster was stuck for about an hour until I resumed the VM running the temporary pfSense instance (and turned off the bare metal server). Almost immediately, everything started working again, despite having identical pfSense settings on both systems.

I initially chalked this up to a transient issue after reformatting the bare metal server again and restoring the backup successfully. It worked fine during the subsequent transfer. However, the problem reoccurred.

Current Issue:

I've recently set up pfSense HA to ensure continuous network availability in case the bare metal server fails. The pfSense HA setup works perfectly, restoring Internet access within a few seconds. However, this time, the Kubernetes/CEPH connection isn't happy.

Results:

Failover Test 1:
Applications went unresponsive for about 10 minutes before suddenly working correctly.

Failover Test 2:
Switched back to primary: Same symptoms as Test 1, 10-20 minutes of unresponsiveness followed by recovery.

Failover Test 3:
Switched back to secondary: Same symptoms.

Failover Test 4:
Switched back to primary: Same symptoms, but the cluster never recovered, even after 90 minutes.

Deleting a unresponsive pod would get this error:

failed to kill container KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox"

with the new pod not able to start

Multi-Attach error for volume Volume is already used by pod(s)

Simultaneously, a new PVC couldn't be provisioned and was timing out.

failed to provision volume with StorageClass "ceph": context deadline exceeded

After 90 minutes, I restarted all the CEPH nodes, and everything instantly started working again.

How can I address this issue? It’s alarming when the entire cluster becomes unresponsive, causing all applications to fail. Notably, the pods always have access to CEPH, so there don't seem to be any network issues there (pinging and cat < /dev/tcp/CEPHIP/6789 and /3300) - if it was an obvious network issue I'd understand more!

Rebooting the router (when I was transferring the router over in my first instance) didn't resolve it; only switching back to the VM router did. It's notable that the VM router was saved and restored, it wasn't a fresh instance, which does make me wonder if it's some issue with states? But I did try clearing the state table when this issue last occurred and that didn't fix anything.

Has anyone else faced a similar issue? I need to solve this because it feels like a ticking time bomb under my cluster. Moreover, I want pfSense HA to work consistently since stability is the main reason for setting it up.

Is there anything specific I should try the next time this occurs? It seems to be pretty easy to reproduce for me.

Thank you for your help!

nixpanic · 2024-08-19T07:44:40Z

nixpanic
Aug 19, 2024
Maintainer

This is a discussion where very detailed knowledge networking with Ceph components is required. Ceph-CSI 'only' acts as a Ceph client, and our understanding of the networking details is limited.

You might have more luck by contacting the Ceph community through its mailinglists or IRC/Slack: https://ceph.com/en/community/connect/

A description like the above, and mentioning if you use CephFS and/or RBD, possibly with a small diagram of how the two clusters are connected, and which IP-addresses/ranges are used would be good.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes pods losing connection to CEPH during pfsense failover #4774

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Kubernetes pods losing connection to CEPH during pfsense failover #4774

brettjenkins Aug 16, 2024

Environment details

Replies: 1 comment

nixpanic Aug 19, 2024 Maintainer

brettjenkins
Aug 16, 2024

nixpanic
Aug 19, 2024
Maintainer