-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: pipeline health checks during rollout #795
Conversation
@@ -22,6 +22,7 @@ spec: | |||
labels: | |||
{{- include "backend.labels" . | nindent 8 }} | |||
spec: | |||
minReadySeconds: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add extra 10 seconds delay for pod to be ready to serve incoming traffic .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've had some api errors from fresh pods starting up in test when under load. I have another PR that lightens up the probes and scaling rates from the HPA - but the init container timing is tricky with probes in general. My thinking here was just a bit of extra security, although now that you mention it 10s may be too much. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend we fix the readiness probe, if required, we add startup probes, but once pod is ready, waiting for 10 seconds, sounds not ideal to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should mention that on the application level, our health check on the api backend is 'dumb' in that it just responds to the http request, not actually if the db is connected and happy, so the k8s probe's source of truth is unreliable, so I'm compensating for that on the infra level. For now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed these - appreciate the feedback @mishraomp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed these - appreciate the feedback @mishraomp
Thanks Jon, I would vote for another ticket in the backlog to fix the readiness and liveness probe in the API :)
Addressed QA comments - do note the new addition of the force pass flag. While we're finding other issues in our infra I wanted to add this so we don't block test/prod with it given the topology of our pipeline. |
Quality Gate passedIssues Measures |
Description
This new scritpt+action runs in tandem with the helm build to QA it at the infra level, highlighting issues that helm may not report back about. Additionally, this script is configured to timeout just before helm and collect information in the event of a failing/timing out helm build, so you get a report on what might be wrong in-PR.
Example of it catching a probe failure in events: https://github.com/bcgov/nr-compliance-enforcement/actions/runs/12170089403/job/33944486709?pr=795
Example of an all green success: https://github.com/bcgov/nr-compliance-enforcement/actions/runs/12171219493/job/33947826812
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist
Further comments
Thanks for the PR!
Deployments, as required, will be available below:
Please create PRs in draft mode. Mark as ready to enable:
After merge, new images are deployed in:
Thanks for the PR!
Deployments, as required, will be available below:
Please create PRs in draft mode. Mark as ready to enable:
After merge, new images are deployed in: