Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health check API endpoints #595

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

oneonestar
Copy link
Member

Description

Add two new health check API endpoints: /trino-gateway/livez and /trino-gateway/readyz.

Additional context and related issues

/trino-gateway/livez always returns 200.
/trino-gateway/readyz returns 200 after the initial fetch from database and the first round of Trino cluster health check is completed. Otherwise, 503 will be returned.

Release notes

(x) Release notes are required, with the following suggested text:

* Added API health endpoints `/trino-gateway/livez` and `/trino-gateway/readyz` for monitoring liveness and readiness. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jan 21, 2025
@oneonestar oneonestar force-pushed the star/health_check_endpoint branch from ede6246 to 819c956 Compare January 21, 2025 01:43
@oneonestar oneonestar marked this pull request as draft January 21, 2025 02:30
@oneonestar
Copy link
Member Author

Might need to bind ActiveClusterMonitor by default instead of injecting it in config.

@mosabua
Copy link
Member

mosabua commented Jan 21, 2025

Have not really looked at the PR but I think we should agree on same names for those probes in Trino Gateway and Trino . What do you think are suitable URLs .. I think it could literally just be ready and live, or liveness and readiness .. or is there some standard that anyone is aware of?

@nineinchnick @martint @dain @electrum ?

Copy link
Member

@nineinchnick nineinchnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK Trino doesn't yet have endpoints dedicated for probes. The z suffix is a convention started at Google, widely adopted in Kubernetes apps. I'm ok with that, OTOH there's little chance we'll have conflicts with other endpoints.

@mosabua
Copy link
Member

mosabua commented Jan 21, 2025

AFAIK Trino doesn't yet have endpoints dedicated for probes. The z suffix is a convention started at Google, widely adopted in Kubernetes apps. I'm ok with that, OTOH there's little chance we'll have conflicts with other endpoints.

Fair enough .. i saw the healthz in others but the two you created @oneonestar are better .. so lets run with that then.

@oneonestar oneonestar force-pushed the star/health_check_endpoint branch from 819c956 to d20f52e Compare January 24, 2025 04:33
@oneonestar oneonestar marked this pull request as ready for review January 24, 2025 08:28
@oneonestar oneonestar force-pushed the star/health_check_endpoint branch from 72becaa to 034b5f6 Compare January 25, 2025 00:23
docs/installation.md Outdated Show resolved Hide resolved
docs/operation.md Outdated Show resolved Hide resolved
docs/operation.md Outdated Show resolved Hide resolved
## API health endpoints

Trino Gateway provides two API endpoints to indicate the current status of the server.
`/trino-gateway/livez` returns status code 200, indicating the server is alive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`/trino-gateway/livez` returns status code 200, indicating the server is alive.
* `/trino-gateway/livez` returns status code 200, indicating the server is alive.

we should detail what "alive" means specifically for Trino Gateway ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also .. can it return something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint always sends back a 200 status code. However, it might not respond if the gateway is too busy, stuck, or taking a long time for garbage collection. This is what "alive" means.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is k8s basics but I think we should let our users know anyway .. not all of them will have k8s background or even run it on k8s


Trino Gateway provides two API endpoints to indicate the current status of the server.
`/trino-gateway/livez` returns status code 200, indicating the server is alive.
`/trino-gateway/readyz` returns status code 200, indicating the server has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`/trino-gateway/readyz` returns status code 200, indicating the server has
* `/trino-gateway/readyz` returns status code 200, indicating the server has

also need to detail what that means .. such as .. the backend database connection is working, the connection to the external routing server (if any) is alive, and so on

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In current implementation readyz means the initial connection to backend database and the first round of health check on Trino clusters were completed. I think this might expose too much detail to the user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No .. I think our users are very technical and we should not hide that information from in the docs .. its not in the UI or so .. but it needs to be in the docs

docs/operation.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants