Problem deleting clustered nodes out-of-order #342

cmu-rgrempel · 2024-06-05T02:52:46Z

I'm trying clustered nodes with ctdb for the first time, using version 0.5. I have been able to get it to work nicely, with a cephfs backend. It's very pleasing!

I've been experimenting with failover etc. by randomly deleting the pods that the operator creates (to simulate evictions or node failures or whatever). What I'm experiencing is that if I delete the second pod (e.g., in my case, cmu-fileshare-1), then it comes back up in the expected way. However, if I delete the pods "out of order" -- that is, delete the first pod (in my case, cmu-fileshare-0), then the pod doesn't come back up successfully.

What I see from kubectl get pods is this:

NAME READY STATUS RESTARTS AGE
cmu-fileshare-0 4/5 CrashLoopBackOff 2 (15s ago) 2m23s
cmu-fileshare-1 5/5 Running 0 6m40s

And what I see from kubectl logs cmu-fileshare-0 -c wb is this:

2024-06-05 02:41:05,535: INFO: Enabling ctdb in samba config file
winbindd version 4.19.6 started.
Copyright Andrew Tridgell and the Samba Team 1992-2023
Could not fetch our SID - did we join?
unable to initialize domain list

I'm wondering whether this might be related to #262, which is another issue which may have something to do with the exact order in which nodes are brought up, and whether certain initialization steps are performed or skipped.

I'll dive into this further if I have time -- just thought I'd jot down the experience in case it is helpful to anyone.

The text was updated successfully, but these errors were encountered:

cmu-rgrempel · 2024-06-06T03:15:02Z

I did some more investigating today.

First, it isn't absolutely clear to me any longer that the problem has something to do with the order in which pods are deleted. At one point today, I could see the problem when deleting pods in the expected order. So, now I'm thinking that it is possible that the problem occurs somewhat randomly when deleting pods.

Today, there were several occasions where a deleted pod would come back up, but winbind was clearly not actually working. (For instance, running id inside the pod would produce odd, incomplete results).

I ran a tdbdump /var/lib/ctdb/persistent/secrets.tdb.X, and noticed that in cases where pods came back up fine, the expected handful of key/value pairs were present there. In cases where the pods came up problematically, there were only two key/value pairs. One was __db_sequence_number__\00, and the other was SECRETS/SID/CMU-FILESHARE-0.

Reflecting on that, I noticed that while the secrets.tdb is said to be "persistent" (and in a directory called "persistent"), it was actually located in an "emptyDir" in the pod configuration. So, that directory was being deleted when I deleted the pod. I wondered whether it might make a difference to actually make that directory persistent. Initial results appear to suggest that this fixes the problem for me. At least, I've gone through several rounds of deleting pods, and they come back up for me consistently now (so far).

In any event, my current theory is that the problem related to the security.tdb somehow being incompletely restored by ctdb when a pod came back up. The fact that it sometimes occurred and sometimes didn't suggests some kind of race condition. But, I don't have enough actual knowledge of samba and ctdb to know whether those are sensible thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem deleting clustered nodes out-of-order #342

Problem deleting clustered nodes out-of-order #342

cmu-rgrempel commented Jun 5, 2024

cmu-rgrempel commented Jun 6, 2024 •

edited

Loading

Problem deleting clustered nodes out-of-order #342

Problem deleting clustered nodes out-of-order #342

Comments

cmu-rgrempel commented Jun 5, 2024

cmu-rgrempel commented Jun 6, 2024 • edited Loading

cmu-rgrempel commented Jun 6, 2024 •

edited

Loading