Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem deleting clustered nodes out-of-order #342

Open
cmu-rgrempel opened this issue Jun 5, 2024 · 1 comment
Open

Problem deleting clustered nodes out-of-order #342

cmu-rgrempel opened this issue Jun 5, 2024 · 1 comment

Comments

@cmu-rgrempel
Copy link

I'm trying clustered nodes with ctdb for the first time, using version 0.5. I have been able to get it to work nicely, with a cephfs backend. It's very pleasing!

I've been experimenting with failover etc. by randomly deleting the pods that the operator creates (to simulate evictions or node failures or whatever). What I'm experiencing is that if I delete the second pod (e.g., in my case, cmu-fileshare-1), then it comes back up in the expected way. However, if I delete the pods "out of order" -- that is, delete the first pod (in my case, cmu-fileshare-0), then the pod doesn't come back up successfully.

What I see from kubectl get pods is this:

NAME READY STATUS RESTARTS AGE
cmu-fileshare-0 4/5 CrashLoopBackOff 2 (15s ago) 2m23s
cmu-fileshare-1 5/5 Running 0 6m40s

And what I see from kubectl logs cmu-fileshare-0 -c wb is this:

2024-06-05 02:41:05,535: INFO: Enabling ctdb in samba config file
winbindd version 4.19.6 started.
Copyright Andrew Tridgell and the Samba Team 1992-2023
Could not fetch our SID - did we join?
unable to initialize domain list

I'm wondering whether this might be related to #262, which is another issue which may have something to do with the exact order in which nodes are brought up, and whether certain initialization steps are performed or skipped.

I'll dive into this further if I have time -- just thought I'd jot down the experience in case it is helpful to anyone.

@cmu-rgrempel
Copy link
Author

cmu-rgrempel commented Jun 6, 2024

I did some more investigating today.

First, it isn't absolutely clear to me any longer that the problem has something to do with the order in which pods are deleted. At one point today, I could see the problem when deleting pods in the expected order. So, now I'm thinking that it is possible that the problem occurs somewhat randomly when deleting pods.

Today, there were several occasions where a deleted pod would come back up, but winbind was clearly not actually working. (For instance, running id inside the pod would produce odd, incomplete results).

I ran a tdbdump /var/lib/ctdb/persistent/secrets.tdb.X, and noticed that in cases where pods came back up fine, the expected handful of key/value pairs were present there. In cases where the pods came up problematically, there were only two key/value pairs. One was __db_sequence_number__\00, and the other was SECRETS/SID/CMU-FILESHARE-0.

Reflecting on that, I noticed that while the secrets.tdb is said to be "persistent" (and in a directory called "persistent"), it was actually located in an "emptyDir" in the pod configuration. So, that directory was being deleted when I deleted the pod. I wondered whether it might make a difference to actually make that directory persistent. Initial results appear to suggest that this fixes the problem for me. At least, I've gone through several rounds of deleting pods, and they come back up for me consistently now (so far).

In any event, my current theory is that the problem related to the security.tdb somehow being incompletely restored by ctdb when a pod came back up. The fact that it sometimes occurred and sometimes didn't suggests some kind of race condition. But, I don't have enough actual knowledge of samba and ctdb to know whether those are sensible thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant