Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tracking and error reporting of startup probe #298

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dciabrin
Copy link
Contributor

@dciabrin dciabrin commented Dec 20, 2024

Currently startup probe are scheduled with defaults from k8s (scheduled every 10s, failure threshold of 3). As galera joiner nodes can take a long time to start, this generates unecessary unhealthy events.

Rework how the startup probe work by allowing a single, long probe which internally loops while probe the startup state. Throughout the startup process, keep track of the specific startup phase so in case the startup times out, the probe can log a precise error.

Also rework how joiner nodes are tracked, to fail early in case galera cannot join a primary partition, to avoid the server being stuck until indefinitely until the startup probe times out.

A subsequent commit will provide the ability to override probe settings and timeouts.

Jira: OSPRH-11392

@openshift-ci openshift-ci bot requested review from dprince and viroel December 20, 2024 15:33
Copy link
Contributor

openshift-ci bot commented Dec 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dciabrin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dciabrin
Copy link
Contributor Author

/retest-required

@dciabrin
Copy link
Contributor Author

/retest-required
CI job could not build image to run kuttl

Currently startup probe are scheduled with defaults from k8s
(scheduled every 10s, failure threshold of 3). As galera joiner
nodes can take a long time to start, this generates unecessary
unhealthy events.

Rework how the startup probe work by allowing a single, long
probe which internally loops while probe the startup state.
Throughout the startup process, keep track of the specific
startup phase so in case the startup times out, the probe can
log a precise error.

Also rework how joiner nodes are tracked, to fail early in case
galera cannot join a primary partition, to avoid the server
being stuck until indefinitely until the startup probe times out.

A subsequent commit will provide the ability to override probe
settings and timeouts.

Jira: OSPRH-11392
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant