Health probe/bad state detection #57

timcosta · 2022-12-01T04:30:14Z

Hey all! Wondering if y'all have thought about adding a health probe endpoint (or status checker of some sort) to this project that does things like asserts on whether or not healthy connections have been opened to the sources that it's supposed to be listening to.

I'm asking because I'm running into a situation where the container is essentially just looping and printing a message like this:

%3|1669868564.698|FAIL|rdkafka#consumer-1| [thrd:sasl_ssl://<redactedbroker>]: sasl_ssl://<redactedbroker>.amazonaws.com:9096/bootstrap: SASL authentication error: Authentication failed during authentication due to invalid credentials with SASL mechanism SCRAM-SHA-512 (after 301ms in state AUTH_REQ, 4 identical error(s) suppressed)

I'm hoping to find a way to have the container exit in situations like this, whether it be using an external health probe to an http endpoint, or some internal status checking that exits/throws an exception in a situation like this. Any thoughts?

The text was updated successfully, but these errors were encountered:

hsheth2 · 2022-12-01T18:45:19Z

@szalai1 would any of your prometheus work be relevant here?

I'm thinking we could reuse the /metrics endpoint as a healthcheck

szalai1 · 2022-12-02T15:21:41Z

the /metrics endpoint will return status code 200 and we can add any values there in the form of metrics. The result on that endpoint is in the form of <metricname> <value> per line. e.g connection_ok 1 , but it needs to be parsed out of the results if we want to used for health checking, this is complicated. so having a separate /health endpoint is probably easier if we want the pod to report health for kubernetes.

options I see to kill the pod in such a case:

as @timcosta said, the process stops itself -> k8s restarts the pod -> CrashLoopBackOff
we add a readiness probe, a check running when the pod comes up and checks weather it could connect to kafa if no -> pod is not read.

tl;dr I think we need a /ready endpoint that returns 200 once connected to kafka + readiness probe set in k8s deployment.

timcosta · 2022-12-02T15:23:27Z

I would agree with the /ready endpoint - sometimes metrics endpoints are "heavy" and not great targets for probes as a result, that's my primary concern with using a metrics endpoint.

shirshanka · 2022-12-05T01:12:12Z

Do we need readiness probes not just for connectivity to kafka.. but also for each individual action?

e.g. if connectivity to DataHub Kafka looks good, but Slack action is unhealthy because the credentials don't work, what would the expected behavior be?

timcosta · 2022-12-05T01:21:33Z

Thats a good question... couple of brief thoughts

If all actions (ingestion+custom) come in/are orchestrated by kafka, kafka is essentially your "database"
- Without Kafka, nothing will ever be done as far as i understand
Actions are like web requests, just because one fail doesnt mean all of them will
- It's entirely possible for example that slack creds go bad but schema ingestion can still continue successfully
  If i think about it that way, I would expect the liveness/readiness endpoint to assert on kafka (database connection) state, and the metrics to expose per-action failures like per-path failures on a webserver

github-actions · 2023-01-04T02:00:12Z

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

timcosta · 2023-01-04T03:29:14Z

Not stale

timcosta · 2023-01-11T21:47:52Z

Hey all -

Bumping this because we just had an actions container enter an unhealthy state where it appeared to run out of file descriptors (SSLError(OSError(24, 'Too many open files'))) but the process didnt exit and we had no way to automatically detect that until we noticed that data ingestion was failing.

github-actions bot added the stale label Jan 4, 2023

hsheth2 added accepted and removed stale labels Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health probe/bad state detection #57

Health probe/bad state detection #57

timcosta commented Dec 1, 2022

hsheth2 commented Dec 1, 2022

szalai1 commented Dec 2, 2022

timcosta commented Dec 2, 2022

shirshanka commented Dec 5, 2022

timcosta commented Dec 5, 2022

github-actions bot commented Jan 4, 2023

timcosta commented Jan 4, 2023

timcosta commented Jan 11, 2023

Health probe/bad state detection #57

Health probe/bad state detection #57

Comments

timcosta commented Dec 1, 2022

hsheth2 commented Dec 1, 2022

szalai1 commented Dec 2, 2022

timcosta commented Dec 2, 2022

shirshanka commented Dec 5, 2022

timcosta commented Dec 5, 2022

github-actions bot commented Jan 4, 2023

timcosta commented Jan 4, 2023

timcosta commented Jan 11, 2023