-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
K8SPS-280: Improve full cluster crash recovery
Before these changes, we were rebooting the cluster from complete outage from pod-0, without checking which member has the latest transactions. Therefore our full cluster recovery implementation was prone to data loss. Now we're using mysql-shell's built-in checks to detect the member to reboot from. For this, mysql-shell requires every member to be reachable, so it can connect and check GTID's in each one. That means in case of full cluster crash we need to start each pod and ensure they're reachable. We're bringing back the `/var/lib/mysql/full-cluster-crash` to address this requirement. Pods create this file if they detect they're in full cluster crash and restart themselves. After the restart, they'll start the mysqld process but ensure the server started as read only. After all pods up and running (ready), the operator will run `dba.rebootClusterFromCompleteOutage()` in one of the MySQL pods. In which pod we run this is not important, since mysql-shell will connect to each pod and select the suitable one to reboot. *Events* This commit also introduces the event recorder and two events: 1. FullClusterCrashDetected 2. FullClusterCrashRecovered Users will be able to see these events on `PerconaServerMySQL` object. For example: ``` $ kubectl describe ps cluster1 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FullClusterCrashDetected 19m (x10 over 20m) ps-controller Full cluster crash detected Normal FullClusterCrashRecovered 17m ps-controller Cluster recovered from full cluster crash ``` *Probe timeouts* Kubernetes had some problems with timeouts in exec probes which they fixed in recent releases. But we still see problematic behaviors. For example, even though Kubernetes successfully detects the timeout in probe it doesn't count the timeouts as failure. So container is not restarted even if its liveness probe timed out million times. With this commit we're handling timeouts by ourselves with contexts.
- Loading branch information
Showing
17 changed files
with
550 additions
and
272 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.