-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
healthcheck not failing even when service registration is not successful #8783
Comments
@micbar I would like to escalate this, for more details see https://github.com/owncloud/enterprise/issues/6547 The NATS client seems to be stuck in a disconnected state . |
Reminds me of #7056 |
I am wondering what our nats clients, using the default options, do after 60 reconnect attempts habe been reached: // Default Constants
const (
Version = "1.33.1"
DefaultURL = "nats://127.0.0.1:4222"
DefaultPort = 4222
DefaultMaxReconnect = 60
DefaultReconnectWait = 2 * time.Second
DefaultReconnectJitter = 100 * time.Millisecond
DefaultReconnectJitterTLS = time.Second
DefaultTimeout = 2 * time.Second
DefaultPingInterval = 2 * time.Minute
DefaultMaxPingOut = 2
DefaultMaxChanLen = 64 * 1024 // 64k
DefaultReconnectBufSize = 8 * 1024 * 1024 // 8MB
RequestChanLen = 8
DefaultDrainTimeout = 30 * time.Second
DefaultFlusherTimeout = time.Minute
LangString = "go"
) hm ... that would explain why the nats js registry logs the "registration error for external service ..." with the "nats: connection closed" 🤔 |
The registry should kill the service if the underlying store cannot recover the underlying connection. |
at least the service should not be healthy if the nats client connection was closed. |
The reproducer is:
If you're looking at the outgoing connections in the oCIS pods, you don't see any attempt / active connection to nats, see #8783 (comment) |
Decided to fix this like the following:
This should cause ocis to automatically reconnect on the next operation when the connection is closed. |
Here is the micro PR: micro/plugins#139 Just a draft for now as I need to test it properly |
@butonic reading the docu you posted: Should we rather use |
Yes. It will correctly reconnect until We still need to add the same mechanic for the events handlers, but service registration is supposed to work properly now. |
@kobergj sorry to ask again, but does the health endpoint switch to not healthy during that reconnect attempts? |
No - we have no control over that. Reconnects are done by the But I would not recommend that. If we implement something like that and nats goes down, it would take all services with it. I would say it is better if just nats is dying and other services reconnect when it is back. |
I'd really like to have a way to see the health of a oCIS deployment WITHOUT looking at the logs. This is not possible with the current approach. Currently a oCIS installation on Kubernetes is a blackbox to me and I only know if it's working when I logged in, uploaded and downloaded a file, created shares, .... |
@wkloucek Let me try to formulate it what we decided together with you some time ago. Service HealthWe define a healthy service when the service itself and its core functions are available. DependenciesA lot of services have dependencies, e.g NATS or the reva gateway or the S3 connection. We do not indicate an unhealthy service when one of its dependencies is not working. Our understanding was, if we do that, the whole service mesh could show as not healthy but only one dependency needs to be fixed / restarted. This case with the NATS connectionIn this case, the ocis service is healty and can work as soon as the NATS service becomes healthy again. |
But even if failing healthchecks are considered bad in this case, we still should reflect unmet dependencies in the readiness status.
This is not contradictory for me. oCIS services can do reconnects and at the same time report itself as unhealthy or not ready. A supervisor like Kubernetes would let this service run for some time and only restart it after x failing healthchecks. (Basically this is what oCIS does right now. Reconnect x times and then exit with status code 1). |
Ok I see. So what would be the ideal behaviour?
For the single binary it is convenient that ocis exits when reconnect fails. |
We can make this work for the single process only. |
For Kubernetes I think it would be desirable to:
If you do the "our problem" vs. "someone else's problem" you also avoid this:
As a drastic example: |
Seems not to be fixed. |
@d7oc shared this log:
Restarting all pods multiple times fixed it in the end. Please note that this looks like a cache implementation and not the service registry |
The NATS service was still there and I didn't found evidences to believe that it was broken. In this end I just restarted it anyway aside the other pods. |
I'd propose:
So far I'd be keen to close this issue since we recycled a issue that is similar but not the same. |
Would agree here as I also don't see how we can work on this issue without further traces. So as long as there is nothing directly known or visible in code which might have caused this I would also vote to close this issue again and create a new one on next appearance. |
if restarting pods fixes it I assume we are suffering the completely broken nats-js-kv implementation in stable5. we need to backport these: #8589 (comment) |
Could #9048 help? I assume that if the service registration fails, the "regular" server stops, so the PR would guarantee that the debug server stops at some point too, causing the healthcheck to fail. The PR needs an update, and it isn't fully finished because it needs changes in reva, but maybe it can help for some services. |
I found an interesting discussion on health und ready checks with insights from kubernetes devs and operators: https://www.reddit.com/r/kubernetes/comments/wayj42/what_should_readiness_liveness_probe_actually/ |
related: #6774 |
Moved to blocked because @dragonchaser any @fschade will look into #9821 first |
It is not blocked, I am not involved with k6 |
Describe the bug
For some reason, all services can't do their service registration:
The only service that actually fails and restarts is the appprovider. All other services pretend to be healthy and stay running.
Steps to reproduce
I have no reproducer to get into that state. (actually recreating all oCIS pods got me out of this state)
Expected behavior
No service returns a positive health status when the service registration doesn't work.
Actual behavior
All pods are marked as healthy.
Setup
I have oCIS 5.0.0 installed via the oCIS Helm Chart (owncloud/ocis-charts@fe5697d) with a external NATS cluster as service registry, store and cache.
Additional context
The oCIS chart uses the /healthz endpoint to determine the status of a oCIS pod: https://github.com/owncloud/ocis-charts/blob/fe5697d0a8a0431d7b0b39e4928beb66f2276baf/charts/ocis/templates/_common/_tplvalues.tpl#L207-L220
for some context:
The text was updated successfully, but these errors were encountered: