K8SPS-359: Fix rebooting cluster if it's partially online #725

egegunes · 2024-08-13T07:16:03Z

CHANGE DESCRIPTION

Problem:
If 2 out of 3 MySQL pods are terminated abruptly, operator fails to reboot the cluster (sporadically).

Cause:
Even though failed pods go into crash recovery mode once they come up again, MySQL complains about cluster being already ONLINE when operator tries to run dba.rebootClusterFromCompleteOutage().

Solution:
If operator can't run dba.rebootClusterFromCompleteOutage() because cluster is already online, delete all MySQL pods to force recovery.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PS version?
Does the change support oldest and newest supported Kubernetes version?

github-actions · 2024-08-13T07:16:25Z

e2e-tests/functions

+    "SELECT MEMBER_HOST FROM performance_schema.replication_group_members where MEMBER_ROLE='PRIMARY';" \
+    "-h $(get_mysql_router_service $(get_cluster_name)) -P 6446 -uroot -proot_password" \
+      | cut -d'.' -f1


[shfmt] _{reported by reviewdog 🐶}

Suggested change

"SELECT MEMBER_HOST FROM performance_schema.replication_group_members where MEMBER_ROLE='PRIMARY';" \

"-h $(get_mysql_router_service $(get_cluster_name)) -P 6446 -uroot -proot_password" \

| cut -d'.' -f1

"SELECT MEMBER_HOST FROM performance_schema.replication_group_members where MEMBER_ROLE='PRIMARY';" \

"-h $(get_mysql_router_service $(get_cluster_name)) -P 6446 -uroot -proot_password" \

| cut -d'.' -f1

inelpandzic · 2024-08-14T15:24:51Z

deploy/rbac.yaml

@@ -80,6 +78,20 @@ rules:
  - patch
  - update
  - watch
+- apiGroups:


Do we need those in cw-rbac.yaml for cluster-wide?

JNKPercona · 2024-08-19T15:34:44Z

Test name	Status
version-service	passed
async-ignore-annotations	passed
auto-config	passed
config	passed
config-router	passed
demand-backup	passed
gr-demand-backup	passed
gr-demand-backup-haproxy	passed
gr-finalizer	passed
gr-haproxy	passed
gr-ignore-annotations	passed
gr-init-deploy	passed
gr-one-pod	passed
gr-recreate	passed
gr-scaling	passed
gr-scheduled-backup	passed
gr-security-context	passed
gr-self-healing	passed
gr-tls-cert-manager	passed
gr-users	passed
haproxy	passed
init-deploy	passed
limits	passed
monitoring	passed
one-pod	passed
operator-self-healing	passed
recreate	passed
scaling	passed
scheduled-backup	passed
service-per-pod	passed
sidecars	passed
smart-update	passed
tls-cert-manager	passed
users	passed
We run 34 out of 34

commit: 75bb02d
image: perconalab/percona-server-mysql-operator:PR-725-75bb02dd

K8SPS-359: Fix rebooting cluster if it's partially online

6bb9ea5

pull-request-size bot added the size/L 100-499 lines label Aug 13, 2024

github-actions bot reviewed Aug 13, 2024

View reviewed changes

egegunes marked this pull request as ready for review August 14, 2024 06:59

egegunes requested review from tplavcic, nmarukovich, ptankov, jvpasinatto, eleo007, hors, inelpandzic and pooknull as code owners August 14, 2024 06:59

fix delete all pods

d4a8e16

inelpandzic reviewed Aug 14, 2024

View reviewed changes

egegunes added 2 commits August 19, 2024 10:55

Merge branch 'main' into K8SPS-359

21ade91

fix cw-rbac

75bb02d

egegunes requested a review from inelpandzic August 19, 2024 08:31

hors approved these changes Aug 19, 2024

View reviewed changes

inelpandzic approved these changes Aug 19, 2024

View reviewed changes

nmarukovich approved these changes Aug 19, 2024

View reviewed changes

hors merged commit d17dc59 into main Aug 19, 2024
16 checks passed

hors deleted the K8SPS-359 branch August 19, 2024 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8SPS-359: Fix rebooting cluster if it's partially online #725

K8SPS-359: Fix rebooting cluster if it's partially online #725

egegunes commented Aug 13, 2024 •

edited by jira bot

Loading

github-actions bot Aug 13, 2024

inelpandzic Aug 14, 2024

egegunes Aug 19, 2024

JNKPercona commented Aug 19, 2024

K8SPS-359: Fix rebooting cluster if it's partially online #725

K8SPS-359: Fix rebooting cluster if it's partially online #725

Conversation

egegunes commented Aug 13, 2024 • edited by jira bot Loading

CHANGE DESCRIPTION

CHECKLIST

github-actions bot Aug 13, 2024

Choose a reason for hiding this comment

inelpandzic Aug 14, 2024

Choose a reason for hiding this comment

egegunes Aug 19, 2024

Choose a reason for hiding this comment

JNKPercona commented Aug 19, 2024

egegunes commented Aug 13, 2024 •

edited by jira bot

Loading