Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bob/7019 prod e2e health check #7145

Merged
merged 28 commits into from
Jan 10, 2024
Merged

Bob/7019 prod e2e health check #7145

merged 28 commits into from
Jan 10, 2024

Conversation

fzhao99
Copy link
Contributor

@fzhao99 fzhao99 commented Jan 8, 2024

DEVOPS PULL REQUEST

Related Issue

  • Resolves Create prod E2E health check #7019 (hopefully without bringing down prod this time as this PR did)
  • From what I can tell, the issue was the addition of a reference to the new class in the security filter chain that was expecting an instance of an actuator endpoint that wasn't there. The reference was in fact redundant since there was already a reference to the overall health endpoint.
    • Locally, the app booted fine because of the special security permissions, but once things got deployed, things blew up.
    • Have verified that the app works fine in two separate lowers (dev4 and dev6) but would welcome additional ideas for validation given the history with this change.

Changes Proposed

  • Adds a backend health actuator endpoint that returns
    • Down if a db table call or an Okta client health call errors
    • Up otherwise
  • Adds a frontend page that pings that endpoint and displays up/down accordingly
  • A workflow that triggers on a prod deploy that sends a Slack alert if things error

Additional Information

Setting up the Pagerduty integration for this was a bit complicated from the ops / infra side, so we opted to use a Slack alert instead in the near term. Followup ticket to swap this out for an eventual Pagerduty alert is here

Testing

The alert is set up in this branch to be triggered post-prod deploy, but workflows can't be triggered by a deploy until the branch defining them gets into main. I've put up this branch
with a push trigger / hard coded env var for dev6 to prove that the script invocation / failure state works for a hard-coded environment variable.

There's not a great way for us to test the "prod deploy" trigger part until this branch gets in, so after merging, I'll make sure to keep an eye on the next prod deploy to make sure everything's working.


DanielSass
DanielSass previously approved these changes Jan 9, 2024
Copy link
Collaborator

@DanielSass DanielSass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@emyl3
Copy link
Collaborator

emyl3 commented Jan 9, 2024

@fzhao99 Couple of follow-up to confirm I'm gaining a better understanding of things 😅

The reference was in fact redundant since there was already a reference to the overall health endpoint

That's this line? and we get that by default it looks like?

Locally, the app booted fine because of the special security permissions, but once things got deployed, things blew up.

Please correct me if I am wrong, but I thought you deployed this on a dev environment before and it didn't blow up? 🤔 I thought the SecurityConfiguration file is used for lowers as well?

Thank you for looking into this!!!

@@ -78,6 +78,7 @@ management:
endpoint.health.probes.enabled: true
endpoint.info.enabled: true
endpoints.web.exposure.include: health, info
endpoint.health.show-components: always
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need management.endpoint.health.show-details as well? It looks like the spring docs mention that if we have security enabled and want to use always then the security configuration must permit access to the health endpoint for both authenticated and unauthenticated users. If I'm understanding it correctly, was that what we were trying to do in the original PR here? Is this now resolved because we are overriding the normal health endpoint which is already included in the security config?

Copy link
Contributor Author

@fzhao99 fzhao99 Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so the docs were a bit confusing here. Spring apparently makes a distinction between the details of the health endpoint (show-details) and just the overall status (show-components)

I talked a little about the tradeoffs for show-details here. TL;DR, it isn't necessary for us to reach and get the status of this particular actuator endpoint and since it exposes more info about the app, I elected to leave it off. I think endpoint.health.show-components just exposes the bare minimum (whether that endpoint is healthy or not), so it works for our purposes.

You can see that the actuator is available even unauthed if you go to https://dev4.simplereport.gov/api/actuator/health/backend-and-db-smoke-test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for breaking this down!! This was super clarifying!

Copy link

@fzhao99
Copy link
Contributor Author

fzhao99 commented Jan 10, 2024

cc @emyl3

The reference was in fact redundant since there was already a reference to the overall health endpoint

That's this line? and we get that by default it looks like?

Yep, we get the overall health endpoint for free, but in the process of trying to get the endpoint to display, I tried adding it to the filter security chain (thinking that I needed to in order to hit that actuator endpoint unauthed.) The actual fix needed was to flip the show-components flag to true as I discovered here but I left this in not thinking it mattered. It actually did matter / was the code that brought things down 😅

I deployed a branch without (on dev4) and without (on dev6) those two extra lines to verify they were indeed the problem. You can see that dev6 replicates the issue (relevant Azure logs here) that showed up on prod

Locally, the app booted fine because of the special security permissions, but once things got deployed, things blew up.

Please correct me if I am wrong, but I thought you deployed this on a dev environment before and it didn't blow up? 🤔 I thought the SecurityConfiguration file is used for lowers as well?

So Merethe and I did a bit of digging on this, but I think I must have deployed a version of the app before adding in the security configuration changes and did smoke tests on that version. Then, because the subsequent deploys didn't break (ie were green with the CI checks working as well) I must have neglected to properly click through and verify things were working. If you filter by the past 30 days on dev6, you can see that there were a bunch of related errors after I deployed the version of the branch with the security filter config in it corresponding to when I last deployed to dev6 (dec 15) until you deployed a copy of main on the 21st.

@fzhao99 fzhao99 requested review from mpbrown and DanielSass January 10, 2024 16:22
Copy link
Collaborator

@mpbrown mpbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for investigating this and providing a thorough explanation of everything!

Copy link
Collaborator

@emyl3 emyl3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!! 🚀

@fzhao99 fzhao99 added this pull request to the merge queue Jan 10, 2024
Merged via the queue into main with commit 50a017c Jan 10, 2024
69 checks passed
@fzhao99 fzhao99 deleted the bob/7019-prod-e2e-health-check branch January 10, 2024 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create prod E2E health check
4 participants