-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Parallel Cluster Recovery didn't work #730
Comments
Hi have seen the same behaviour for a change from 2.10.0 to 2.11.1. |
Just had the same issue. All my k8s nodes crashed, but after k8s recovery, OpenSearch cluster did not recovered. Only one pod from statefulset started. I also had to stop the operator and manually change Unfortunately I did not find anything in logs, there was no error. Operator just did nothing. It did not switch to parallel recovery. |
[Triage] Thanks |
Hello, @prudhvigodithi |
@prudhvigodithi I also encountered this problem with the last version (2.5.1) |
@swoehrl-mw can you please take a look at this? Thanks |
@pstast @pawellrus @Links2004
There are some points in the logic where the operator can skip the recovery due to errors, but these should all produce some log. |
@swoehrl-mw answer to all questions - yes. |
Hi all, I did some looking through the code and testing and I think I can reproduce the problem and found the cause: |
Thanks @swoehrl-mw we can target this after v2.6.0 release. |
### Description Fixes a bug where parallel recovery did not engage if the cluster had gone through a version upgrade beforehand. Reason was that the upgrade logic added the status for each nodepool twice leading to the recovery logic incorrectly detecting if an upgrade was in progress. Also did a small refactoring of names and constants and fixed a warning by my IDE about unused parameters. No changes to the CRDs or functionality, just internal logic fixes. ### Issues Resolved Fixes #730 ### Check List - [x] Commits are signed per the DCO using --signoff - [x] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]>
### Description Fixes a bug where parallel recovery did not engage if the cluster had gone through a version upgrade beforehand. Reason was that the upgrade logic added the status for each nodepool twice leading to the recovery logic incorrectly detecting if an upgrade was in progress. Also did a small refactoring of names and constants and fixed a warning by my IDE about unused parameters. No changes to the CRDs or functionality, just internal logic fixes. ### Issues Resolved Fixes #730 ### Check List - [x] Commits are signed per the DCO using --signoff - [x] Unittest added for the new/changed functionality and all unit tests are successful - [x] Customer-visible features documented - [x] No linter warnings (`make lint`) If CRDs are changed: - [-] CRD YAMLs updated (`make manifests`) and also copied into the helm chart - [-] Changes to CRDs documented By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check [here](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#developer-certificate-of-origin). Signed-off-by: Sebastian Woehrl <[email protected]> (cherry picked from commit 4f59766)
Version: 2.4.0, 2.5.1
Function: Parallel Cluster Recovery
Description:
After performing upgrade of managed k8s cluster, the hosted opensearch cluster stuck in "waiting for quorum" state.
Scenario:
Performed k8s cluster upgrade. During upgrade all nodes was recreated and pods was rescheduled TWISE at the same time.
Expected behaviour:
All nodes should be started at the same time. (statefulset has podManagementPolicy: Parallel)
Cluster recovered.
Actual behaviour:
Only one node of cluster is started. Other nodes are not starting. Quorum cannot be achieved.
Operator does nothing with this situation. Statefulset has podManagementPolicy: OrderedReady
Thoughts:
I fixed this problem by scaling down operator deployment to 0 and then manually recreating STS with podManagementPolicy: Parallel.
Not sure that I can reproduce that case again. Maybe operator should have some manual trigger to perform parallel cluster restoration in cases like that?
The text was updated successfully, but these errors were encountered: