-
Notifications
You must be signed in to change notification settings - Fork 125
NO-ISSUE: Extend k8s suite timeouts for parallel testing load #2497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@jupierce: the contents of this pull request could not be automatically validated. The following commits could not be validated and must be approved by a top-level approver:
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jupierce The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4dcde0e to
723f884
Compare
|
@jupierce: the contents of this pull request could not be automatically validated. The following commits could not be validated and must be approved by a top-level approver:
Comment |
|
@jupierce: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/verified by e2e testing |
|
@jupierce: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.
723f884 to
59a2d7e
Compare
|
@jupierce: the contents of this pull request could not be automatically validated. The following commits could not be validated and must be approved by a top-level approver:
Comment |
|
@jupierce: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that the controller load mentioned here is necessary to the tests? kubernetes#131518 comes to mind as an example where a group of E2E tests were failing not because they were invalid tests but because they were generating an unnecessary amount of reconciliation work for controllers, and the controllers could not catch up before the tests timed out. |
|
As to whether this can go upstream, I don't have a strong opinion. The k8s-suite tests are run in parallel with non-k8s-suite tests from origin. Our load is therefore unique and upstream may not face the same early timeout issues. We also test on a minimum supported CPU configuration. If upstream is testing with more CPU, they may not see it. @benluddy the analysis uses statistical tools across the huge volume of data we collect in our CI runs. It outputs insights like "Test X fails 6x more often when run at the same time as test Y". The pattern I've seen for these timeout issues is that test X fails 3x-6x more often when run with in parallel with any of a dozen other tests (since our testing is randomized, test X can get paired with different tests across CI runs). For the cases I've investigated at the code level, test X is just waiting (i.e. it is not misbehaving). So we can conclude that (a) the other dozen tests are misbehaving, (b) that a given controller is not performant, or (c) we are asking too much of the system in the time permitted. (a) and (b) are unlikely or extremely expensive to fix relative to the reward here, so I'm advocating for (c). In cases where "Test X fails % more when run at the same time as test Y", but there are only 1 or 2 Y's, I'm treating those as a test bug to pursue. |
Analysis of flakes from the k8s suite has shown consistent examples of otherwise well behaved testing failing due timeouts because of temporary load on controllers during parallel testing. Increasing these timeouts will reduce flakes.