-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bob/7019 prod e2e health check #7145
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@fzhao99 Couple of follow-up to confirm I'm gaining a better understanding of things 😅
That's this line? and we get that by default it looks like?
Please correct me if I am wrong, but I thought you deployed this on a dev environment before and it didn't blow up? 🤔 I thought the Thank you for looking into this!!! |
@@ -78,6 +78,7 @@ management: | |||
endpoint.health.probes.enabled: true | |||
endpoint.info.enabled: true | |||
endpoints.web.exposure.include: health, info | |||
endpoint.health.show-components: always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need management.endpoint.health.show-details
as well? It looks like the spring docs mention that if we have security enabled and want to use always
then the security configuration must permit access to the health endpoint for both authenticated and unauthenticated users. If I'm understanding it correctly, was that what we were trying to do in the original PR here? Is this now resolved because we are overriding the normal health endpoint which is already included in the security config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so the docs were a bit confusing here. Spring apparently makes a distinction between the details of the health endpoint (show-details
) and just the overall status (show-components
)
I talked a little about the tradeoffs for show-details
here. TL;DR, it isn't necessary for us to reach and get the status of this particular actuator endpoint and since it exposes more info about the app, I elected to leave it off. I think endpoint.health.show-components
just exposes the bare minimum (whether that endpoint is healthy or not), so it works for our purposes.
You can see that the actuator is available even unauthed if you go to https://dev4.simplereport.gov/api/actuator/health/backend-and-db-smoke-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for breaking this down!! This was super clarifying!
Quality Gate passedKudos, no new issues were introduced! 0 New issues |
cc @emyl3
Yep, we get the overall health endpoint for free, but in the process of trying to get the endpoint to display, I tried adding it to the filter security chain (thinking that I needed to in order to hit that actuator endpoint unauthed.) The actual fix needed was to flip the I deployed a branch without (on dev4) and without (on dev6) those two extra lines to verify they were indeed the problem. You can see that dev6 replicates the issue (relevant Azure logs here) that showed up on prod
So Merethe and I did a bit of digging on this, but I think I must have deployed a version of the app before adding in the security configuration changes and did smoke tests on that version. Then, because the subsequent deploys didn't break (ie were green with the CI checks working as well) I must have neglected to properly click through and verify things were working. If you filter by the past 30 days on dev6, you can see that there were a bunch of related errors after I deployed the version of the branch with the security filter config in it corresponding to when I last deployed to dev6 (dec 15) until you deployed a copy of main on the 21st. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for investigating this and providing a thorough explanation of everything!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!! 🚀
DEVOPS PULL REQUEST
Related Issue
Changes Proposed
Additional Information
Setting up the Pagerduty integration for this was a bit complicated from the ops / infra side, so we opted to use a Slack alert instead in the near term. Followup ticket to swap this out for an eventual Pagerduty alert is here
Testing
The alert is set up in this branch to be triggered post-prod deploy, but workflows can't be triggered by a deploy until the branch defining them gets into main. I've put up this branch
with a push trigger / hard coded env var for dev6 to prove that the script invocation / failure state works for a hard-coded environment variable.
There's not a great way for us to test the "prod deploy" trigger part until this branch gets in, so after merging, I'll make sure to keep an eye on the next prod deploy to make sure everything's working.