Skip to content
This repository has been archived by the owner on Aug 19, 2024. It is now read-only.

ci: Run E2E tests against all versions of K8s we claim support for #397

Conversation

rm3l
Copy link
Member

@rm3l rm3l commented Jul 5, 2024

Description

To help us detect potential issues in advance, this expands the E2E configuration matrix to test several versions of K8s:

Which issue(s) does this PR fix or relate to

PR acceptance criteria

  • Tests
  • Documentation
  • If the bundle manifests have been updated, make sure to review the rhdh-operator.csv.yaml file accordingly

How to test changes / Special notes to the reviewer

Example of run here: https://github.com/rm3l/janus-idp-operator/actions/runs/9809691738

The failures are not related to the changes here, but to an issue with the default janus-idp/backstage-showcase image's latest tag being outdated, and recently-merged #390 only works in 1.2+. See https://redhat-internal.slack.com/archives/C04CUSD4JSG/p1720191475569659

Copy link
Member

@gazarenkov gazarenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of issues to me:

  • It is not clear what issue we are solving here and so the "status" of this testing is not clear as well. It refers to OCP versions which clearly based on certain K8s version, but the tests are running against K8s (sure) so it can not be taken as OCP compatibility tests anyway.
  • It definitely brings some useful compatibility information for K8s X.Y version, while Z version can be basically "any" (no breaking changes expected). Taking into account that K8s releases every 3-4 months, to me it sounds like overkill to make it nightly.
  • In my opinion having them nightly just overloads the actions output (and calculation resources :) ) with mostly the same information.

So, I would say it as a useful test but we probably need to have more clear definition of it. For example: as an upstream pre-release tests which makes sure we support supportable K8s versions and happened as a part of pre-release procedure and not nightly.

@rm3l
Copy link
Member Author

rm3l commented Jul 10, 2024

There are a couple of issues to me:

  • It is not clear what issue we are solving here and so the "status" of this testing is not clear as well. It refers to OCP versions which clearly based on certain K8s version, but the tests are running against K8s (sure) so it can not be taken as OCP compatibility tests anyway.

I would ideally run those tests directly on OCP if it was as easy as a GitHub Action, but I couldn't find a simple way to spin up a minimal OCP cluster in a GitHub runner. So this uses the OCP versions we claim support for as baselines to determine the versions of K8s we could/should test against.

  • It definitely brings some useful compatibility information for K8s X.Y version, while Z version can be basically "any" (no breaking changes expected). Taking into account that K8s releases every 3-4 months, to me it sounds like overkill to make it nightly.

The variable here is the operator code (which can change frequently), which I think we should have real E2E tests for, but against the versions of K8s that we should support (this list won't change frequently either).
There is also one useful part here in this Workflow IMO: ahead-of-time testing against versions of K8s that we do not support yet will help us catch potential issues in advance.

  • In my opinion having them nightly just overloads the actions output (and calculation resources :) ) with mostly the same information.

The actions output are separated per job, so per configuration in the matrix. So it won't change that much from the existing nightly workflow. It is just we have new jobs from the matrix to look at.
In terms of resources, all matrix configuration jobs run in parallel, so there is no impact on the overall time of the whole Workflow. The current nightly workflow on main takes ~19m to run, and the Workflow from this PR took almost the same amount of time.

So, I would say it as a useful test but we probably need to have more clear definition of it. For example: as an upstream pre-release tests which makes sure we support supportable K8s versions and happened as a part of pre-release procedure and not nightly.

The goal of the nightly Workflow is to prevent a last-minute rush before a release when you notice something does not work end-to-end, by running a comprehensive suite of tests against the released image regularly. This way, based on your last nightly job status, you can even release any good commit at any time.

Are you concerned about having to check the status too frequently?

@coreydaley
Copy link
Member

You can setup tests on OpenShift CI to run against OpenShift clusters, there are lots of examples of how to do this in the github.com/openshift org, specifically look at the https://github.com/openshift/release repository

@rm3l
Copy link
Member Author

rm3l commented Jul 10, 2024

You can setup tests on OpenShift CI to run against OpenShift clusters, there are lots of examples of how to do this in the github.com/openshift org, specifically look at the openshift/release repository

Yeah, I know, and that would be a potential next step (if we agree with the approach here).. For now, this tries to leverage the existing GH Workflow we have for testing against vanilla K8s.

@gazarenkov
Copy link
Member

There are a couple of issues to me:

  • It is not clear what issue we are solving here and so the "status" of this testing is not clear as well. It refers to OCP versions which clearly based on certain K8s version, but the tests are running against K8s (sure) so it can not be taken as OCP compatibility tests anyway.

I would ideally run those tests directly on OCP if it was as easy as a GitHub Action, but I couldn't find a simple way to spin up a minimal OCP cluster in a GitHub runner. So this uses the OCP versions we claim support for as baselines to determine the versions of K8s we could/should test against.

  • It definitely brings some useful compatibility information for K8s X.Y version, while Z version can be basically "any" (no breaking changes expected). Taking into account that K8s releases every 3-4 months, to me it sounds like overkill to make it nightly.

The variable here is the operator code (which can change frequently), which I think we should have real E2E tests for, but against the versions of K8s that we should support (this list won't change frequently either). There is also one useful part here in this Workflow IMO: ahead-of-time testing against versions of K8s that we do not support yet will help us catch potential issues in advance.

  • In my opinion having them nightly just overloads the actions output (and calculation resources :) ) with mostly the same information.

The actions output are separated per job, so per configuration in the matrix. So it won't change that much from the existing nightly workflow. It is just we have new jobs from the matrix to look at. In terms of resources, all matrix configuration jobs run in parallel, so there is no impact on the overall time of the whole Workflow. The current nightly workflow on main takes ~19m to run, and the Workflow from this PR took almost the same amount of time.

So, I would say it as a useful test but we probably need to have more clear definition of it. For example: as an upstream pre-release tests which makes sure we support supportable K8s versions and happened as a part of pre-release procedure and not nightly.

The goal of the nightly Workflow is to prevent a last-minute rush before a release when you notice something does not work end-to-end, by running a comprehensive suite of tests against the released image regularly. This way, based on your last nightly job status, you can even release any good commit at any time.

Are you concerned about having to check the status too frequently?

Like I said, the compatibility tests (k8s or/and OCP if possible) are useful.

At the same time in absolute majority of the cases they do not bring any useful information daily, so very quickly will be percieved as a "white noice", as a related example you can see current state of nightly tests, they failed in many cases by known reason but we do not care about it b/c they are not that important b/c pretty much everything is covered in other (PR) flows. On the other hand, from outside, it looks like as broken environment. That's why I really doubt in usefulness of daily tests in our context(!) at all.

So, I see such tests rather as a part of the (conscious) [pre/after] release process/recommendations than periodical job.

I do not really understand your "last-minute rush" concern, it's time to focus on development and time to focus on release ("code freeze" state in our case). Could you please be more specific about the cases you have in mind?

@rm3l
Copy link
Member Author

rm3l commented Jul 11, 2024

Like I said, the compatibility tests (k8s or/and OCP if possible) are useful.

At the same time in absolute majority of the cases they do not bring any useful information daily, so very quickly will be percieved as a "white noice", as a related example you can see current state of nightly tests, they failed in many cases by known reason but we do not care about it b/c they are not that important b/c pretty much everything is covered in other (PR) flows. On the other hand, from outside, it looks like as broken environment. That's why I really doubt in usefulness of daily tests in our context(!) at all.

PR tests currently do not run against a real cluster to make sure that everything work end-to-end. So to me, I am not confident enough that a merged PR with the current tests passing guarantees that the deployed application is actually reachable (which is what matters to the end-user, right?).
Take #390 for example. All PR tests passed, but the nightly job that ran against it allowed us to detect the day after an issue with the latest tag of the Showcase application (because the pod deployed ended up crashing on a real cluster).
I would rather not accumulate a lot of changes and run a manual E2E test a few days before a release..

So, I see such tests rather as a part of the (conscious) [pre/after] release process/recommendations than periodical job.

Is there such a process in place? A recommendation is only useful if there is a way to enforce it, and a nightly job is an almost cheap and automated way to enforce that. ;)

I do not really understand your "last-minute rush" concern, it's time to focus on development and time to focus on release ("code freeze" state in our case). Could you please be more specific about the cases you have in mind?

If we want to release at any time, nightly tests will help for example.

@rm3l rm3l marked this pull request as draft July 12, 2024 10:13
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress PR should not merge because it is a draft or work in progress. Required by Prow. label Jul 12, 2024
@rm3l rm3l marked this pull request as ready for review July 12, 2024 10:13
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress PR should not merge because it is a draft or work in progress. Required by Prow. label Jul 12, 2024
@openshift-ci openshift-ci bot requested a review from gazarenkov July 12, 2024 10:13
Copy link
Member

@gazarenkov gazarenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this compatibility testing as good to have for sure but:

The state of current nightly tests in general is not satisfable. They are not stable for a quite long time and so looks broken.
#399
That's why extending it with any new functionality is clearly no go from my side.

@rm3l The point about not having "real cluster tests" on CI (except ones from nightly) is valid and we have to solve it ASAP.
After that I do not see any reason to keep nightly jobs until we are sure we are able properly maintain it and see real needs in it.

Copy link

openshift-ci bot commented Jul 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from gazarenkov. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rm3l
Copy link
Member Author

rm3l commented Jul 12, 2024

I see this compatibility testing as good to have for sure but:

The state of current nightly tests in general is not satisfable. They are not stable for a quite long time and so looks broken. #399 That's why extending it with any new functionality is clearly no go from my side.

I already stated that they are not passing (only since the past few days) because of #390 which was recently merged. The issue (thankfully caught quickly by the nightly checks) is being tracked in this JIRA: https://issues.redhat.com/browse/RHIDP-3157
Before that, they have been quite stable so far. From time to time, some infra issues due to the runner itself, but nothing critical per se.

Anyway, let's bring this discussion in one of our WG calls, to have more opinions on how to approach this.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold PR should not merge yet because someone has issued a /hold command. Required by Prow. label Jul 12, 2024
@gazarenkov
Copy link
Member

gazarenkov commented Jul 15, 2024

I see this compatibility testing as good to have for sure but:
The state of current nightly tests in general is not satisfable. They are not stable for a quite long time and so looks broken. #399 That's why extending it with any new functionality is clearly no go from my side.

I already stated that they are not passing (only since the past few days) because of #390 which was recently merged. The issue (thankfully caught quickly by the nightly checks) is being tracked in this JIRA: https://issues.redhat.com/browse/RHIDP-3157 Before that, they have been quite stable so far. From time to time, some infra issues due to the runner itself, but nothing critical per se.

Looks like we are talking different issues.

Take a look here:

https://github.com/janus-idp/operator/actions?page=2&query=event%3Aschedule

And you can see that majority of Nightly builds are failed.
Here's the issue I created for it #399 , please take a look
Do you mean something else saying it is not a problem?

Anyway, let's bring this discussion in one of our WG calls, to have more opinions on how to approach this.

/hold

@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rm3l rm3l closed this Jul 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
do-not-merge/hold PR should not merge yet because someone has issued a /hold command. Required by Prow. needs-rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants