Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC 630] add readiness healthchecks for apps #630

Merged
merged 12 commits into from
Aug 22, 2023

Conversation

ameowlia
Copy link
Member

@ameowlia ameowlia requested review from a team, rkoster, beyhan, stephanme, AP-Hunt and emalm and removed request for a team June 26, 2023 14:27
@ameowlia ameowlia changed the title add RFC for readiness healthchecks add readiness healthchecks for apps Jun 26, 2023
Copy link
Member

@beyhan beyhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I really like the proposal. I added some minor comments and questions I have.

toc/rfc/rfc-draft-readiness-healthchecks.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-readiness-healthchecks.md Outdated Show resolved Hide resolved
@maxmoehl
Copy link
Member

Currently diego replaces the envoy certificate when the application instance is going down. By doing so gorouter will fail to connect to envoy because of the wrong certificate while the app is known to be unavailable and the de-registration has not yet been propagated to gorouter.

The same mechanism could be used for readiness checks to prevent requests from being set to the application in the small timeframe between the readiness check going down and the de-registration being propagated to gorouter.

@peanball
Copy link
Contributor

peanball commented Jul 21, 2023

We just had a discussion for a customer scenario issue that might be interesting for this RFC.

The readiness probe is a scheduled polling, like the health check (e.g. every n seconds). When the readiness state changes, the cascade of route updates, etc. takes place.

What we would like to add is an immediate control from the app to the readiness state checker, to be able to say: "My app is not ready, right now!".

The other way around this of course also works and could save some time, but the case where an app becomes not ready is from our point of view much more important.

It would allow app developers to protect their app from an overload scenario.

Consequences or requests for the "immediate" part of the state change:

  • New request should not be accepted right away. This means that possibly the TLS termination proxy (i.e. Envoy) could be adjusted to serve an invalid TLS cert, as it does for stopping instances.
  • The cascade of messages to Gorouter should still be done. "Disabling" the TLS proxy is only to increase the responsiveness of the "not ready" state.

What do you think about this, and do you see that this might still fit in the frame of this RFC?

@ChrisMcGowan ChrisMcGowan self-requested a review July 25, 2023 14:31
@mariash
Copy link
Member

mariash commented Jul 25, 2023

@beyhan Can this be resolved by lowering the poll interval?

@beyhan
Copy link
Member

beyhan commented Jul 26, 2023

@beyhan Can this be resolved by lowering the poll interval?

@mariash at the moment the interval is 30s which is also documented in here. What options to lower the interval we have? It could help but won't be perfect because there is a gap depending on how long the poll interval is in which the app instance could get requests.

@ameowlia for completeness could you please add the concerns you expressed during the TOC meeting yesterday regarding the replace the certificates suggestion.

@ameowlia
Copy link
Member Author

ameowlia commented Aug 9, 2023

@peanball, I really like the idea of the immediate state change, however I have some concerns with doing it via the envoy cert because of c2c implications.

Failure scenario

Here is the scenario I am imagining:

  1. App has a /ready endpoint and a readiness health check pointing to it.
  2. End users are happily accessing the AI via gorouter and envoy.
  3. Other apps are happily accessing the AI via internal routes and c2c with tls via envoy.
  4. Oh no! /ready returns 500! App instance is marked not ready.
  5. Immediately the cert is removed so no traffic can succeed through envoy.
  6. The end user traffic through gorouter fails with a retriable error, and the end user gets routed to a different app that is ready. The end user doesn't notice anything. Nice!
  7. The other apps trying to access the app via c2c with tls get failure messages for 20 seconds until the route is removed from service discovery controller. 🔥

Technical Details

Internal routes are resolved via bosh dns + service discovery controller. They are not proxied through anything like gorouter that can add automatic retries. Any retry logic would have to be added to the app.

Future regarding this RFC

I suggest this RFC continue on without the requirement for an immediate state change. I think we should release what we have currently proposed as a v1 and get feedback on it. And during that time we should continue to think about how this is possible to get this immediate state change without negative consequences to c2c with tls.

@ameowlia
Copy link
Member Author

ameowlia commented Aug 9, 2023

@mariash - can you add to the RFC about exposing the interval? It won't be a perfect fix, for this issue, but could improve it.

edited to add: ✅ done

@peanball
Copy link
Contributor

peanball commented Aug 9, 2023

No objections to continue without immediate state change for this RFC and consider it in a larger context.

We are currently not using c2c networking because of the missing load balancing across instances and other bits (e.g. retries) that we get via gorouter.

toc/rfc/rfc-draft-readiness-healthchecks.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-readiness-healthchecks.md Show resolved Hide resolved
toc/rfc/rfc-draft-readiness-healthchecks.md Outdated Show resolved Hide resolved
toc/rfc/rfc-draft-readiness-healthchecks.md Outdated Show resolved Hide resolved
* add interval property
* add process type for backwards compatibility
@ameowlia ameowlia changed the title add readiness healthchecks for apps [RFC 630] add readiness healthchecks for apps Aug 14, 2023
@ameowlia
Copy link
Member Author

Starting Final Comment Period. FCP should end on Aug 22, 2023.

@ChrisMcGowan ChrisMcGowan merged commit 923e897 into main Aug 22, 2023
@rkoster rkoster deleted the rfc-readiness-healthchecks branch August 22, 2023 14:42
ameowlia added a commit that referenced this pull request Aug 25, 2023
@joergdw
Copy link
Contributor

joergdw commented Sep 29, 2023

The link on the top of this PR does not work for me. https://github.com/cloudfoundry/community/blob/main/toc/rfc/rfc-0020-readiness-healthchecks.md however does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc CFF community RFC toc
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.