Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERFSCALE-3428] Cluster health-check #727

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rsevilla87
Copy link
Member

@rsevilla87 rsevilla87 commented Oct 9, 2024

Type of change

  • Refactor
  • New feature
  • Bug fix
  • Optimization
  • Documentation Update

Description

Performing a cluster operator health check before running the actual workload,
I have considered adding this logic as a step in prow, but I ended up adding this logic here for several reasons like:

  • Too many jobs: a new chain will have to be added to every job
  • More maintenance: maintaining steps in prow is more difficult than here.

Please leave your thoughts here.

Related Tickets & Documents

  • Related Issue #
  • Closes #

Copy link

openshift-ci bot commented Oct 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link

openshift-ci bot commented Oct 9, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rsevilla87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


source ./env.sh
source ../../utils/common.sh
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consolidate into "echo" rather than "log"

Signed-off-by: Raul Sevilla <[email protected]>
@rsevilla87
Copy link
Member Author

rsevilla87 commented Oct 9, 2024

@@ -1,6 +1,7 @@
#!/bin/bash -e

set -e
oc adm wait-for-stable-cluster --minimum-stable-period=1m --timeout=5m
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if we need this here, while we already have cluster health check enabled by default in ocp-wrapper driver code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's probably redundant, let me think in some use cases where cluster-health is not performed...

@@ -1,6 +1,7 @@
#!/bin/bash -e

set -e
oc adm wait-for-stable-cluster --minimum-stable-period=1m --timeout=5m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5m might be too short. for example, if we want to run kube-burner after an upgrade, I am seeing some of the pods (kubeapi) take > 5 min just to roll out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The control plane pods are supposed to be rolled out once the upgrade is finished, another potential issue is the mcp rollout, which runs asynchronously of the oc adm upgrade command.

Also, I think we're not running any workload after an upgrade completion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to set --minimum-stable-period to 1m and I saw kube-burner errors due to the cluster still stabilizing, but keeping it at 5min I have had better results, just my current experience...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants