Improve tracking and error reporting of startup probe #298

dciabrin · 2024-12-20T15:33:04Z

Currently startup probe are scheduled with defaults from k8s (scheduled every 10s, failure threshold of 3). As galera joiner nodes can take a long time to start, this generates unecessary unhealthy events.

Rework how the startup probe work by allowing a single, long probe which internally loops while probe the startup state. Throughout the startup process, keep track of the specific startup phase so in case the startup times out, the probe can log a precise error.

Also rework how joiner nodes are tracked, to fail early in case galera cannot join a primary partition, to avoid the server being stuck until indefinitely until the startup probe times out.

A subsequent commit will provide the ability to override probe settings and timeouts.

Jira: OSPRH-11392

openshift-ci · 2024-12-20T15:33:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dciabrin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dciabrin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dciabrin · 2024-12-23T09:43:55Z

/retest-required

dciabrin · 2024-12-23T10:46:41Z

/retest-required
CI job could not build image to run kuttl

Currently startup probe are scheduled with defaults from k8s (scheduled every 10s, failure threshold of 3). As galera joiner nodes can take a long time to start, this generates unecessary unhealthy events. Rework how the startup probe work by allowing a single, long probe which internally loops while probe the startup state. Throughout the startup process, keep track of the specific startup phase so in case the startup times out, the probe can log a precise error. Also rework how joiner nodes are tracked, to fail early in case galera cannot join a primary partition, to avoid the server being stuck until indefinitely until the startup probe times out. A subsequent commit will provide the ability to override probe settings and timeouts. Jira: OSPRH-11392

openshift-ci bot requested review from dprince and viroel December 20, 2024 15:33

openshift-ci bot added the approved label Dec 20, 2024

dciabrin force-pushed the startup-probe branch from b844f44 to 340ee75 Compare December 23, 2024 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tracking and error reporting of startup probe #298

Improve tracking and error reporting of startup probe #298

dciabrin commented Dec 20, 2024 •

edited by openshift-ci bot

Loading

openshift-ci bot commented Dec 20, 2024

dciabrin commented Dec 23, 2024

dciabrin commented Dec 23, 2024

Improve tracking and error reporting of startup probe #298

Are you sure you want to change the base?

Improve tracking and error reporting of startup probe #298

Conversation

dciabrin commented Dec 20, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented Dec 20, 2024

dciabrin commented Dec 23, 2024

dciabrin commented Dec 23, 2024

dciabrin commented Dec 20, 2024 •

edited by openshift-ci bot

Loading