Bob/7019 prod e2e health check #7145

fzhao99 · 2024-01-08T14:00:21Z

DEVOPS PULL REQUEST

Related Issue

Resolves Create prod E2E health check #7019 (hopefully without bringing down prod this time as this PR did)
From what I can tell, the issue was the addition of a reference to the new class in the security filter chain that was expecting an instance of an actuator endpoint that wasn't there. The reference was in fact redundant since there was already a reference to the overall health endpoint.
- Locally, the app booted fine because of the special security permissions, but once things got deployed, things blew up.
- Have verified that the app works fine in two separate lowers (dev4 and dev6) but would welcome additional ideas for validation given the history with this change.

Changes Proposed

Adds a backend health actuator endpoint that returns
- Down if a db table call or an Okta client health call errors
- Up otherwise
Adds a frontend page that pings that endpoint and displays up/down accordingly
A workflow that triggers on a prod deploy that sends a Slack alert if things error

Additional Information

Setting up the Pagerduty integration for this was a bit complicated from the ops / infra side, so we opted to use a Slack alert instead in the near term. Followup ticket to swap this out for an eventual Pagerduty alert is here

Testing

The alert is set up in this branch to be triggered post-prod deploy, but workflows can't be triggered by a deploy until the branch defining them gets into main. I've put up this branch
with a push trigger / hard coded env var for dev6 to prove that the script invocation / failure state works for a hard-coded environment variable.

There's not a great way for us to test the "prod deploy" trigger part until this branch gets in, so after merging, I'll make sure to keep an eye on the next prod deploy to make sure everything's working.

DanielSass

lgtm

emyl3 · 2024-01-09T17:33:01Z

@fzhao99 Couple of follow-up to confirm I'm gaining a better understanding of things 😅

The reference was in fact redundant since there was already a reference to the overall health endpoint

That's this line? and we get that by default it looks like?

Locally, the app booted fine because of the special security permissions, but once things got deployed, things blew up.

Please correct me if I am wrong, but I thought you deployed this on a dev environment before and it didn't blow up? 🤔 I thought the SecurityConfiguration file is used for lowers as well?

Thank you for looking into this!!!

mpbrown · 2024-01-09T19:44:51Z

backend/src/main/resources/application.yaml

@@ -78,6 +78,7 @@ management:
  endpoint.health.probes.enabled: true
  endpoint.info.enabled: true
  endpoints.web.exposure.include: health, info
+  endpoint.health.show-components: always


Do we need management.endpoint.health.show-details as well? It looks like the spring docs mention that if we have security enabled and want to use always then the security configuration must permit access to the health endpoint for both authenticated and unauthenticated users. If I'm understanding it correctly, was that what we were trying to do in the original PR here? Is this now resolved because we are overriding the normal health endpoint which is already included in the security config?

Yeah so the docs were a bit confusing here. Spring apparently makes a distinction between the details of the health endpoint (show-details) and just the overall status (show-components)

I talked a little about the tradeoffs for show-details here. TL;DR, it isn't necessary for us to reach and get the status of this particular actuator endpoint and since it exposes more info about the app, I elected to leave it off. I think endpoint.health.show-components just exposes the bare minimum (whether that endpoint is healthy or not), so it works for our purposes.

You can see that the actuator is available even unauthed if you go to https://dev4.simplereport.gov/api/actuator/health/backend-and-db-smoke-test

Thank you for breaking this down!! This was super clarifying!

sonarqubecloud · 2024-01-10T14:57:31Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
82.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

fzhao99 · 2024-01-10T16:12:46Z

cc @emyl3

The reference was in fact redundant since there was already a reference to the overall health endpoint

That's this line? and we get that by default it looks like?

Yep, we get the overall health endpoint for free, but in the process of trying to get the endpoint to display, I tried adding it to the filter security chain (thinking that I needed to in order to hit that actuator endpoint unauthed.) The actual fix needed was to flip the show-components flag to true as I discovered here but I left this in not thinking it mattered. It actually did matter / was the code that brought things down 😅

I deployed a branch without (on dev4) and without (on dev6) those two extra lines to verify they were indeed the problem. You can see that dev6 replicates the issue (relevant Azure logs here) that showed up on prod

Locally, the app booted fine because of the special security permissions, but once things got deployed, things blew up.

Please correct me if I am wrong, but I thought you deployed this on a dev environment before and it didn't blow up? 🤔 I thought the SecurityConfiguration file is used for lowers as well?

So Merethe and I did a bit of digging on this, but I think I must have deployed a version of the app before adding in the security configuration changes and did smoke tests on that version. Then, because the subsequent deploys didn't break (ie were green with the CI checks working as well) I must have neglected to properly click through and verify things were working. If you filter by the past 30 days on dev6, you can see that there were a bunch of related errors after I deployed the version of the branch with the security filter config in it corresponding to when I last deployed to dev6 (dec 15) until you deployed a copy of main on the 21st.

mpbrown

LGTM! Thanks for investigating this and providing a thorough explanation of everything!

emyl3

LGTM!!! 🚀

fzhao99 added 24 commits January 8, 2024 08:59

frontend component and script

2655882

backend config

26f7891

actions

f749843

rename

1b5612f

some other stuff

b984e00

add slack alert back in

9e25175

remove slack comment

f1d0d6e

move slack alert over

20f99da

dan feedback

e6757bb

add okta call and update script config

edaf50e

lint

c640d3e

remove trailing slash

4ee1215

remove empty var

87a6249

remove comment

6db0690

move url to one place

d30c568

use existing status check instead

03236e9

string format and equality

9463564

move literal to left

286ccc5

lol it's friday alright

bc22644

add comment to document workflow

aa87bd7

better comment

f74a96f

use base domain env var instead

503b3fd

set env var

663933a

don't hard code node version

215afe5

fzhao99 had a problem deploying to dev6 January 8, 2024 14:03 — with GitHub Actions Error

fzhao99 temporarily deployed to dev4 January 8, 2024 14:03 — with GitHub Actions Inactive

fzhao99 temporarily deployed to dev4 January 8, 2024 14:11 — with GitHub Actions Inactive

add endpoint annotation

be2209f

fzhao99 temporarily deployed to dev4 January 8, 2024 14:20 — with GitHub Actions Inactive

fzhao99 temporarily deployed to dev4 January 8, 2024 14:31 — with GitHub Actions Inactive

fzhao99 temporarily deployed to dev6 January 8, 2024 16:03 — with GitHub Actions Inactive

add a third argument catch

e09d362

fzhao99 temporarily deployed to dev4 January 8, 2024 16:13 — with GitHub Actions Inactive

fzhao99 temporarily deployed to dev6 January 8, 2024 16:13 — with GitHub Actions Inactive

fzhao99 temporarily deployed to dev4 January 8, 2024 16:19 — with GitHub Actions Inactive

fzhao99 temporarily deployed to dev6 January 8, 2024 16:19 — with GitHub Actions Inactive

fzhao99 marked this pull request as ready for review January 8, 2024 16:52

fzhao99 requested review from mehansen, DanielSass, emyl3, mpbrown and alismx January 8, 2024 16:58

DanielSass previously approved these changes Jan 9, 2024

View reviewed changes

mpbrown reviewed Jan 9, 2024

View reviewed changes

code smell

64960c1

fzhao99 dismissed DanielSass’s stale review via 64960c1 January 10, 2024 14:44

fzhao99 temporarily deployed to dev4 January 10, 2024 16:00 — with GitHub Actions Inactive

fzhao99 temporarily deployed to dev4 January 10, 2024 16:09 — with GitHub Actions Inactive

fzhao99 requested review from mpbrown and DanielSass January 10, 2024 16:22

mpbrown approved these changes Jan 10, 2024

View reviewed changes

emyl3 approved these changes Jan 10, 2024

View reviewed changes

fzhao99 added this pull request to the merge queue Jan 10, 2024

Merged via the queue into main with commit 50a017c Jan 10, 2024
69 checks passed

fzhao99 deleted the bob/7019-prod-e2e-health-check branch January 10, 2024 20:45

fzhao99 mentioned this pull request Jan 12, 2024

Test that the prod deploy alert will alert if the backend and frontend can't talk to each other #7160

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bob/7019 prod e2e health check #7145

Bob/7019 prod e2e health check #7145

fzhao99 commented Jan 8, 2024 •

edited

Loading

DanielSass left a comment

emyl3 commented Jan 9, 2024 •

edited

Loading

mpbrown Jan 9, 2024

fzhao99 Jan 10, 2024 •

edited

Loading

emyl3 Jan 10, 2024

sonarqubecloud bot commented Jan 10, 2024

fzhao99 commented Jan 10, 2024 •

edited

Loading

mpbrown left a comment

emyl3 left a comment

Bob/7019 prod e2e health check #7145

Bob/7019 prod e2e health check #7145

Conversation

fzhao99 commented Jan 8, 2024 • edited Loading

DEVOPS PULL REQUEST

Related Issue

Changes Proposed

Additional Information

Testing

DanielSass left a comment

Choose a reason for hiding this comment

emyl3 commented Jan 9, 2024 • edited Loading

mpbrown Jan 9, 2024

Choose a reason for hiding this comment

fzhao99 Jan 10, 2024 • edited Loading

Choose a reason for hiding this comment

emyl3 Jan 10, 2024

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 10, 2024

Quality Gate passed

fzhao99 commented Jan 10, 2024 • edited Loading

mpbrown left a comment

Choose a reason for hiding this comment

emyl3 left a comment

Choose a reason for hiding this comment

fzhao99 commented Jan 8, 2024 •

edited

Loading

emyl3 commented Jan 9, 2024 •

edited

Loading

fzhao99 Jan 10, 2024 •

edited

Loading

fzhao99 commented Jan 10, 2024 •

edited

Loading