Kubernetes pods losing connection to CEPH during pfsense failover #4774
Unanswered
brettjenkins
asked this question in
Q&A
Replies: 1 comment
-
This is a discussion where very detailed knowledge networking with Ceph components is required. Ceph-CSI 'only' acts as a Ceph client, and our understanding of the networking details is limited. You might have more luck by contacting the Ceph community through its mailinglists or IRC/Slack: https://ceph.com/en/community/connect/ A description like the above, and mentioning if you use CephFS and/or RBD, possibly with a small diagram of how the two clusters are connected, and which IP-addresses/ranges are used would be good. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey,
I've been experiencing an intermittent but pretty severe issue that I wanted to raise here.
Environment details
Background:
I first encountered this issue during some router maintenance. I had to reinstall pfSense, so I created a temporary VM, restored my backup to that VM, and kept the network running while I reformatted the bare metal server. However, when I switched back to the bare metal server, my Kubernetes cluster started behaving strangely. Applications became unresponsive; for instance, Emby would begin to load its page but then fail midway. Deleting the affected pods would hang; the old pods would timeout during deletion, preventing new pods from starting.
The cluster was stuck for about an hour until I resumed the VM running the temporary pfSense instance (and turned off the bare metal server). Almost immediately, everything started working again, despite having identical pfSense settings on both systems.
I initially chalked this up to a transient issue after reformatting the bare metal server again and restoring the backup successfully. It worked fine during the subsequent transfer. However, the problem reoccurred.
Current Issue:
I've recently set up pfSense HA to ensure continuous network availability in case the bare metal server fails. The pfSense HA setup works perfectly, restoring Internet access within a few seconds. However, this time, the Kubernetes/CEPH connection isn't happy.
Results:
Failover Test 1:
Applications went unresponsive for about 10 minutes before suddenly working correctly.
Failover Test 2:
Switched back to primary: Same symptoms as Test 1, 10-20 minutes of unresponsiveness followed by recovery.
Failover Test 3:
Switched back to secondary: Same symptoms.
Failover Test 4:
Switched back to primary: Same symptoms, but the cluster never recovered, even after 90 minutes.
Deleting a unresponsive pod would get this error:
with the new pod not able to start
Simultaneously, a new PVC couldn't be provisioned and was timing out.
After 90 minutes, I restarted all the CEPH nodes, and everything instantly started working again.
How can I address this issue? It’s alarming when the entire cluster becomes unresponsive, causing all applications to fail. Notably, the pods always have access to CEPH, so there don't seem to be any network issues there (pinging and cat < /dev/tcp/CEPHIP/6789 and /3300) - if it was an obvious network issue I'd understand more!
Rebooting the router (when I was transferring the router over in my first instance) didn't resolve it; only switching back to the VM router did. It's notable that the VM router was saved and restored, it wasn't a fresh instance, which does make me wonder if it's some issue with states? But I did try clearing the state table when this issue last occurred and that didn't fix anything.
Has anyone else faced a similar issue? I need to solve this because it feels like a ticking time bomb under my cluster. Moreover, I want pfSense HA to work consistently since stability is the main reason for setting it up.
Is there anything specific I should try the next time this occurs? It seems to be pretty easy to reproduce for me.
Thank you for your help!
Beta Was this translation helpful? Give feedback.
All reactions