Improve tracking and error reporting of startup probe #298
+205
−16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently startup probe are scheduled with defaults from k8s (scheduled every 10s, failure threshold of 3). As galera joiner nodes can take a long time to start, this generates unecessary unhealthy events.
Rework how the startup probe work by allowing a single, long probe which internally loops while probe the startup state. Throughout the startup process, keep track of the specific startup phase so in case the startup times out, the probe can log a precise error.
Also rework how joiner nodes are tracked, to fail early in case galera cannot join a primary partition, to avoid the server being stuck until indefinitely until the startup probe times out.
A subsequent commit will provide the ability to override probe settings and timeouts.
Jira: OSPRH-11392