Skip to content

Conversation

@jupierce
Copy link

Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.

@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Oct 24, 2025
@openshift-ci-robot
Copy link

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci openshift-ci bot requested review from jerpeter1 and tkashem October 24, 2025 23:56
@openshift-ci
Copy link

openshift-ci bot commented Oct 24, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jupierce
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@jupierce jupierce changed the title UPSTREAM: <carry>: extend k8s suite timeouts for parallel testing load Extend k8s suite timeouts for parallel testing load Oct 25, 2025
@jupierce jupierce changed the title Extend k8s suite timeouts for parallel testing load NO-ISSSUE: Extend k8s suite timeouts for parallel testing load Oct 27, 2025
@jupierce jupierce changed the title NO-ISSSUE: Extend k8s suite timeouts for parallel testing load NO-ISSUE: Extend k8s suite timeouts for parallel testing load Oct 27, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 27, 2025
@openshift-ci-robot
Copy link

@jupierce: This pull request explicitly references no jira issue.

In response to this:

Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jupierce
Copy link
Author

/verified by e2e testing

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 27, 2025
@openshift-ci-robot
Copy link

@jupierce: This PR has been marked as verified by e2e testing.

In response to this:

/verified by e2e testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Analysis of flakes from the k8s suite has shown consistent examples
of otherwise well behaved testing failing due timeouts because of
temporary load on controllers during parallel testing. Increasing these
timeouts will reduce flakes.
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Oct 27, 2025
@openshift-ci-robot
Copy link

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@openshift-ci
Copy link

openshift-ci bot commented Oct 27, 2025

@jupierce: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial 59a2d7e link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-techpreview-serial 59a2d7e link false /test e2e-aws-ovn-techpreview-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@bertinatto bertinatto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try to get this upstream instead? If we have data about these timeouts being too short under high load, I think we can make a point in bumping these values.

Edit: and we would avoid carrying a commit.

@benluddy @jacobsee WDYT?

@benluddy
Copy link

benluddy commented Nov 6, 2025

Could we try to get this upstream instead? If we have data about these timeouts being too short under high load, I think we can make a point in bumping these values.

Edit: and we would avoid carrying a commit.

Are we sure that the controller load mentioned here is necessary to the tests? kubernetes#131518 comes to mind as an example where a group of E2E tests were failing not because they were invalid tests but because they were generating an unnecessary amount of reconciliation work for controllers, and the controllers could not catch up before the tests timed out.

@jupierce
Copy link
Author

jupierce commented Nov 6, 2025

As to whether this can go upstream, I don't have a strong opinion. The k8s-suite tests are run in parallel with non-k8s-suite tests from origin. Our load is therefore unique and upstream may not face the same early timeout issues. We also test on a minimum supported CPU configuration. If upstream is testing with more CPU, they may not see it.

@benluddy the analysis uses statistical tools across the huge volume of data we collect in our CI runs. It outputs insights like "Test X fails 6x more often when run at the same time as test Y". The pattern I've seen for these timeout issues is that test X fails 3x-6x more often when run with in parallel with any of a dozen other tests (since our testing is randomized, test X can get paired with different tests across CI runs). For the cases I've investigated at the code level, test X is just waiting (i.e. it is not misbehaving). So we can conclude that (a) the other dozen tests are misbehaving, (b) that a given controller is not performant, or (c) we are asking too much of the system in the time permitted. (a) and (b) are unlikely or extremely expensive to fix relative to the reward here, so I'm advocating for (c).

In cases where "Test X fails % more when run at the same time as test Y", but there are only 1 or 2 Y's, I'm treating those as a test bug to pursue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants