chore: pipeline health checks during rollout #795

jon-funk · 2024-12-04T23:21:36Z

Description

This new scritpt+action runs in tandem with the helm build to QA it at the infra level, highlighting issues that helm may not report back about. Additionally, this script is configured to timeout just before helm and collect information in the event of a failing/timing out helm build, so you get a report on what might be wrong in-PR.

Example of it catching a probe failure in events: https://github.com/bcgov/nr-compliance-enforcement/actions/runs/12170089403/job/33944486709?pr=795

Example of an all green success: https://github.com/bcgov/nr-compliance-enforcement/actions/runs/12171219493/job/33947826812

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Demonstrated success and failure examples

Checklist

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Further comments

Thanks for the PR!

Deployments, as required, will be available below:

Please create PRs in draft mode. Mark as ready to enable:

Analysis Workflow

After merge, new images are deployed in:

Merge Workflow

Thanks for the PR!

Deployments, as required, will be available below:

Please create PRs in draft mode. Mark as ready to enable:

Analysis Workflow

After merge, new images are deployed in:

Merge Workflow

mishraomp · 2024-12-05T01:59:31Z

charts/app/templates/backend/templates/deployment.yaml

@@ -22,6 +22,7 @@ spec:
      labels:
        {{- include "backend.labels" . | nindent 8 }}
    spec:
+      minReadySeconds: 10


do we need to add extra 10 seconds delay for pod to be ready to serve incoming traffic .

We've had some api errors from fresh pods starting up in test when under load. I have another PR that lightens up the probes and scaling rates from the HPA - but the init container timing is tricky with probes in general. My thinking here was just a bit of extra security, although now that you mention it 10s may be too much. Thoughts?

I would recommend we fix the readiness probe, if required, we add startup probes, but once pod is ready, waiting for 10 seconds, sounds not ideal to me

I should mention that on the application level, our health check on the api backend is 'dumb' in that it just responds to the http request, not actually if the db is connected and happy, so the k8s probe's source of truth is unreliable, so I'm compensating for that on the infra level. For now.

removed these - appreciate the feedback @mishraomp

removed these - appreciate the feedback @mishraomp

Thanks Jon, I would vote for another ticket in the backlog to fix the readiness and liveness probe in the API :)

.github/scripts/rollout_healthcheck.sh

…rcement into CE-1176

jon-funk · 2024-12-10T18:17:01Z

Addressed QA comments - do note the new addition of the force pass flag. While we're finding other issues in our infra I wanted to add this so we don't block test/prod with it given the topology of our pipeline.

sonarqubecloud · 2024-12-10T22:54:21Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

jon-funk added 3 commits December 4, 2024 14:10

first draft of script

384d5da

fix event parsing

205a77e

cleanup and comments

b007707

jon-funk added pipeline change Change that updates the pipeline not ready Not ready for review, WIP, do not merge. labels Dec 4, 2024

jon-funk added 4 commits December 4, 2024 16:31

tweaks and fix webeoc startup sensitivity

73375e6

ease up on backend readiness

2abe06d

improve event messaging

a3da4f1

make event filter more resistant to transiant events

da45e97

mishraomp reviewed Dec 5, 2024

View reviewed changes

jon-funk added 3 commits December 5, 2024 11:50

remove minready

883ad76

update help str

6bc02a8

cleanup actions and enable for test/prod

3460023

jon-funk removed the not ready Not ready for review, WIP, do not merge. label Dec 5, 2024

Merge branch 'release/0.6.9' into CE-1176

788f024

afwilcox requested changes Dec 6, 2024

View reviewed changes

.github/scripts/rollout_healthcheck.sh Outdated Show resolved Hide resolved

.github/scripts/rollout_healthcheck.sh Show resolved Hide resolved

.github/scripts/rollout_healthcheck.sh Outdated Show resolved Hide resolved

jon-funk added 3 commits December 9, 2024 13:33

address change requests

e8b44d7

Merge branch 'CE-1176' of https://github.com/bcgov/nr-compliance-enfo…

085cf5c

…rcement into CE-1176

Merge branch 'release/0.6.9' into CE-1176

23ec702

jon-funk temporarily deployed to tools December 9, 2024 21:36 — with GitHub Actions Inactive

jon-funk temporarily deployed to tools December 9, 2024 21:37 — with GitHub Actions Inactive

jon-funk added 9 commits December 9, 2024 14:01

fix podlist globbing

74db640

Merge branch 'CE-1176' of https://github.com/bcgov/nr-compliance-enfo…

5a8e0b0

…rcement into CE-1176

Merge branch 'release/0.6.9' into CE-1176

98c2a51

cleanup variable

2bb8533

consistent auth names

cc54b58

extend metabase readiness probe

d018664

pvc break test

7a3fb7a

revert pvc test

a979d11

add force pass to prevent blocking

296e327

afwilcox added 2 commits December 10, 2024 14:40

Merge branch 'release/0.6.9' into CE-1176

5e1ade8

Merge branch 'release/0.6.9' into CE-1176

6d00b72

afwilcox approved these changes Dec 10, 2024

View reviewed changes

afwilcox merged commit 8aaeb87 into release/0.6.9 Dec 10, 2024
17 checks passed

afwilcox deleted the CE-1176 branch December 10, 2024 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: pipeline health checks during rollout #795

chore: pipeline health checks during rollout #795

jon-funk commented Dec 4, 2024 •

edited by github-actions bot

Loading

mishraomp Dec 5, 2024

jon-funk Dec 5, 2024

mishraomp Dec 5, 2024

jon-funk Dec 5, 2024

jon-funk Dec 5, 2024

mishraomp Dec 5, 2024

jon-funk commented Dec 10, 2024

sonarqubecloud bot commented Dec 10, 2024

chore: pipeline health checks during rollout #795

chore: pipeline health checks during rollout #795

Conversation

jon-funk commented Dec 4, 2024 • edited by github-actions bot Loading

Description

How Has This Been Tested?

Checklist

Further comments

mishraomp Dec 5, 2024

Choose a reason for hiding this comment

jon-funk Dec 5, 2024

Choose a reason for hiding this comment

mishraomp Dec 5, 2024

Choose a reason for hiding this comment

jon-funk Dec 5, 2024

Choose a reason for hiding this comment

jon-funk Dec 5, 2024

Choose a reason for hiding this comment

mishraomp Dec 5, 2024

Choose a reason for hiding this comment

jon-funk commented Dec 10, 2024

sonarqubecloud bot commented Dec 10, 2024

Quality Gate passed

jon-funk commented Dec 4, 2024 •

edited by github-actions bot

Loading