NO-ISSUE: Extend k8s suite timeouts for parallel testing load #2497

jupierce · 2025-10-24T23:56:20Z

Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.

openshift-ci-robot · 2025-10-24T23:56:30Z

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

4dcde0e|UPSTREAM: : extend k8s suite timeouts for parallel testing load: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci · 2025-10-24T23:56:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jupierce
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

DOWNSTREAM_OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-10-25T02:23:22Z

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

723f884|Extend k8s suite timeouts for parallel testing load: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci-robot · 2025-10-27T13:32:15Z

@jupierce: This pull request explicitly references no jira issue.

In response to this:

Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jupierce · 2025-10-27T13:32:42Z

/verified by e2e testing

openshift-ci-robot · 2025-10-27T13:32:54Z

@jupierce: This PR has been marked as verified by e2e testing.

In response to this:

/verified by e2e testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.

openshift-ci-robot · 2025-10-27T13:36:14Z

@jupierce: the contents of this pull request could not be automatically validated.

The following commits could not be validated and must be approved by a top-level approver:

59a2d7e|UPSTREAM: : Extend k8s suite timeouts for parallel testing load: does not specify an upstream backport in the commit message

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

openshift-ci · 2025-10-27T19:19:52Z

@jupierce: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-serial	`59a2d7e`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/e2e-aws-ovn-techpreview-serial	`59a2d7e`	link	false	`/test e2e-aws-ovn-techpreview-serial`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

bertinatto

Could we try to get this upstream instead? If we have data about these timeouts being too short under high load, I think we can make a point in bumping these values.

Edit: and we would avoid carrying a commit.

@benluddy @jacobsee WDYT?

benluddy · 2025-11-06T20:47:47Z

Could we try to get this upstream instead? If we have data about these timeouts being too short under high load, I think we can make a point in bumping these values.

Edit: and we would avoid carrying a commit.

Are we sure that the controller load mentioned here is necessary to the tests? kubernetes#131518 comes to mind as an example where a group of E2E tests were failing not because they were invalid tests but because they were generating an unnecessary amount of reconciliation work for controllers, and the controllers could not catch up before the tests timed out.

jupierce · 2025-11-06T21:53:04Z

As to whether this can go upstream, I don't have a strong opinion. The k8s-suite tests are run in parallel with non-k8s-suite tests from origin. Our load is therefore unique and upstream may not face the same early timeout issues. We also test on a minimum supported CPU configuration. If upstream is testing with more CPU, they may not see it.

@benluddy the analysis uses statistical tools across the huge volume of data we collect in our CI runs. It outputs insights like "Test X fails 6x more often when run at the same time as test Y". The pattern I've seen for these timeout issues is that test X fails 3x-6x more often when run with in parallel with any of a dozen other tests (since our testing is randomized, test X can get paired with different tests across CI runs). For the cases I've investigated at the code level, test X is just waiting (i.e. it is not misbehaving). So we can conclude that (a) the other dozen tests are misbehaving, (b) that a given controller is not performant, or (c) we are asking too much of the system in the time permitted. (a) and (b) are unlikely or extremely expensive to fix relative to the reward here, so I'm advocating for (c).

In cases where "Test X fails % more when run at the same time as test Y", but there are only 1 or 2 Y's, I'm treating those as a test bug to pursue.

openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Oct 24, 2025

openshift-ci bot requested review from jerpeter1 and tkashem October 24, 2025 23:56

jupierce force-pushed the flake_reduction branch from 4dcde0e to 723f884 Compare October 25, 2025 02:23

jupierce changed the title ~~UPSTREAM: <carry>: extend k8s suite timeouts for parallel testing load~~ Extend k8s suite timeouts for parallel testing load Oct 25, 2025

jupierce changed the title ~~Extend k8s suite timeouts for parallel testing load~~ NO-ISSSUE: Extend k8s suite timeouts for parallel testing load Oct 27, 2025

jupierce changed the title ~~NO-ISSSUE: Extend k8s suite timeouts for parallel testing load~~ NO-ISSUE: Extend k8s suite timeouts for parallel testing load Oct 27, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 27, 2025

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Oct 27, 2025

jupierce force-pushed the flake_reduction branch from 723f884 to 59a2d7e Compare October 27, 2025 13:36

openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Oct 27, 2025

bertinatto reviewed Nov 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NO-ISSUE: Extend k8s suite timeouts for parallel testing load #2497

NO-ISSUE: Extend k8s suite timeouts for parallel testing load #2497

Uh oh!

jupierce commented Oct 24, 2025

Uh oh!

openshift-ci-robot commented Oct 24, 2025

Uh oh!

openshift-ci bot commented Oct 24, 2025

Uh oh!

openshift-ci-robot commented Oct 25, 2025

Uh oh!

openshift-ci-robot commented Oct 27, 2025

Uh oh!

jupierce commented Oct 27, 2025

Uh oh!

openshift-ci-robot commented Oct 27, 2025

Uh oh!

openshift-ci-robot commented Oct 27, 2025

Uh oh!

openshift-ci bot commented Oct 27, 2025

Uh oh!

bertinatto left a comment •

edited

Loading

Uh oh!

benluddy commented Nov 6, 2025

Uh oh!

jupierce commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NO-ISSUE: Extend k8s suite timeouts for parallel testing load #2497

Are you sure you want to change the base?

NO-ISSUE: Extend k8s suite timeouts for parallel testing load #2497

Uh oh!

Conversation

jupierce commented Oct 24, 2025

Uh oh!

openshift-ci-robot commented Oct 24, 2025

Uh oh!

openshift-ci bot commented Oct 24, 2025

Uh oh!

openshift-ci-robot commented Oct 25, 2025

Uh oh!

openshift-ci-robot commented Oct 27, 2025

Uh oh!

jupierce commented Oct 27, 2025

Uh oh!

openshift-ci-robot commented Oct 27, 2025

Uh oh!

openshift-ci-robot commented Oct 27, 2025

Uh oh!

openshift-ci bot commented Oct 27, 2025

Uh oh!

bertinatto left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benluddy commented Nov 6, 2025

Uh oh!

jupierce commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bertinatto left a comment •

edited

Loading