Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big problems when half of the workers are down (2/4) #1258

Open
gpow81 opened this issue Dec 7, 2024 · 2 comments
Open

Big problems when half of the workers are down (2/4) #1258

gpow81 opened this issue Dec 7, 2024 · 2 comments

Comments

@gpow81
Copy link

gpow81 commented Dec 7, 2024

Hello,
We have a strorage scale configured via CNSA, connected to a remote scale cluster.
I have trouble making the system function after losing 2 out of 4 worker nodes. I was expecting that whatever is needed for the scale to work, would be automatically rescheduled on remaining nodes.
One observation I've made is that, in a setup with 4 worker nodes, the CNSA deployment creates:
1 provisioner pod,
2 attacher pods,
1 operator pod,
2 GUI pods, and
2 pmcollector pods.
1 resizeer
1 snapshooter

When half of the workers go offline, some of the CNSA pods eventually reschedule onto the remaining nodes, but the process takes a significant amount of time. Many of the pods go into CrashLoopBackOff including attachers.
The situation with the GUI pods is worse—they get stuck in a terminating state and never reschedule.
Generally we end up in a very unhealthy scenario.

Once the workers are brought back online, everything recovers, and all pods return to normal operation. But that’s far from ideal.

My question is: Can we configure the critical CNSA ReplicaSets to maintain 4 replicas (ensuring there’s always at least one running, even if only one worker remains)? If so, how can we achieve this?
Or, how to force critical pods (for example attacher) to run on specific worker nodes? This would help to mitigate the issue because I could make sure that the 2 worker nodes we are losing are not the ones that run both copies of attacher/gui/collector pods. But that doesn't help with provisioner, we need this one to scale to a bigger number than 1.

Thank you!

@gpow81
Copy link
Author

gpow81 commented Dec 7, 2024

I was able to force some of the nodes to spread out properly by modifying the cr CSIScaleOperator and adding my own label to provisionerNodeSelector, resizerNodeSelector, snapshotterNodeSelector, attacherNodeSelector.
Is that the right wat to do that ?
I am still not sure how to properly pin gui pods to speciffic nodes. I was able to spread them around by simply rebooting the cluster and getting lucky that they ended up in "correct" places.

But the main problem is still with the GUI pods even if I make sure that one "survives".

As soon as one (out of two) of the GUI pods go down , the other one goes into 3/4 Running status
Warning Unhealthy 4m51s (x32 over 15m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 501
oc logs doesn't really tell what is going on but I think the container that is failing readyiness is "liberty" although it seems to be running/

In this state PVs cannot be attached anymore:

Warning FailedAttachVolume 90s (x3 over 5m35s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-c6be73ca-bf8b-4399-8848-4edc1451a1f0" : rpc error: code = Internal desc = ControllerPublishVolume : Error in getting filesystem Name for filesystem ID of AC10460D:6726F422. Error [rpc error: code = Internal desc = Response unmarshal failed: GET request https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.ocp.x.x:443/scalemgmt/v2/filesystems?filter=uuid=AC10460D:6726F422, user: CsiAdmin, param: { }, response: &{503 Service Unavailable 503 HTTP/1.0 1 0 map[Cache-Control:[private, max-age=0, no-cache, no-store] Content-Type:[text/html] Pragma:[no-cache]] 0xc0003b65c0 -1 [] true false map[] 0xc000254a20 0xc000340000}, error json.Unmarshal failed invalid character '<' looking for beginning of value]

I am thinking that the quorum is missing inside of the CNSA but I don't understand why this is a problem. I thought that GUI doesn't need quorum and it should serve REST api no matter what. It's there only to contact with a remote cluster which of course is healthy.

I am starting to think that I should not be using CNSA and just focus on scale CSI. I am guessing I would not have these issues or would I ?

@gpow81
Copy link
Author

gpow81 commented Dec 7, 2024

Additional note: The GUI pod that survived actually breaks when the CNSA cluster is loosing the quorum I am pretty sure at this point. That's a problem and I am not sure how to solve that if I have an even number of nodes (and even number of physical servers - 4:2).
On the real Storage scale cluster this was solved with a tiebreaker disk. How can I solved it here? This is just a "client" CNSA cluster so I would hope there is a way to make it working with 4 worker nodes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant