Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: pipeline health checks during rollout #795

Merged
merged 25 commits into from
Dec 10, 2024
Merged

Conversation

jon-funk
Copy link
Collaborator

@jon-funk jon-funk commented Dec 4, 2024

Description

This new scritpt+action runs in tandem with the helm build to QA it at the infra level, highlighting issues that helm may not report back about. Additionally, this script is configured to timeout just before helm and collect information in the event of a failing/timing out helm build, so you get a report on what might be wrong in-PR.

Example of it catching a probe failure in events: https://github.com/bcgov/nr-compliance-enforcement/actions/runs/12170089403/job/33944486709?pr=795

Example of an all green success: https://github.com/bcgov/nr-compliance-enforcement/actions/runs/12171219493/job/33947826812

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Demonstrated success and failure examples

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Further comments


Thanks for the PR!

Deployments, as required, will be available below:

Please create PRs in draft mode. Mark as ready to enable:

After merge, new images are deployed in:


Thanks for the PR!

Deployments, as required, will be available below:

Please create PRs in draft mode. Mark as ready to enable:

After merge, new images are deployed in:

@jon-funk jon-funk added pipeline change Change that updates the pipeline not ready Not ready for review, WIP, do not merge. labels Dec 4, 2024
@@ -22,6 +22,7 @@ spec:
labels:
{{- include "backend.labels" . | nindent 8 }}
spec:
minReadySeconds: 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add extra 10 seconds delay for pod to be ready to serve incoming traffic .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had some api errors from fresh pods starting up in test when under load. I have another PR that lightens up the probes and scaling rates from the HPA - but the init container timing is tricky with probes in general. My thinking here was just a bit of extra security, although now that you mention it 10s may be too much. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend we fix the readiness probe, if required, we add startup probes, but once pod is ready, waiting for 10 seconds, sounds not ideal to me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should mention that on the application level, our health check on the api backend is 'dumb' in that it just responds to the http request, not actually if the db is connected and happy, so the k8s probe's source of truth is unreliable, so I'm compensating for that on the infra level. For now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these - appreciate the feedback @mishraomp

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed these - appreciate the feedback @mishraomp

Thanks Jon, I would vote for another ticket in the backlog to fix the readiness and liveness probe in the API :)

@jon-funk jon-funk removed the not ready Not ready for review, WIP, do not merge. label Dec 5, 2024
.github/scripts/rollout_healthcheck.sh Outdated Show resolved Hide resolved
.github/scripts/rollout_healthcheck.sh Show resolved Hide resolved
.github/scripts/rollout_healthcheck.sh Outdated Show resolved Hide resolved
@jon-funk
Copy link
Collaborator Author

Addressed QA comments - do note the new addition of the force pass flag. While we're finding other issues in our infra I wanted to add this so we don't block test/prod with it given the topology of our pipeline.

@afwilcox afwilcox merged commit 8aaeb87 into release/0.6.9 Dec 10, 2024
17 checks passed
@afwilcox afwilcox deleted the CE-1176 branch December 10, 2024 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pipeline change Change that updates the pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants