-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581
[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581
Conversation
rayServiceClusterStatus.DashboardStatus.LastUpdateTime = &timeNow | ||
rayServiceClusterStatus.DashboardStatus.IsHealthy = isHealthy | ||
if rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime.IsZero() || isHealthy { | ||
if rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime.IsZero() || oldIsHealthy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the purpose of this change and how it fixes the first issue in the PR. Can you explain this more?
KubeRay will not update the status of the RayService custom resource when the only changes are related to timestamps
It sounds like this is the real issue. Why don't we just update HealthLastUpdateTime every time? Wouldn't that fix the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like this is the real issue. Why don't we just update HealthLastUpdateTime every time?
This will overload the Kubernetes API Server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the purpose of this change and how it fixes the first issue in the PR. Can you explain this more?
By definition, healthLastUpdateTime: $t
and isHealth: true
mean that the dashboard agent is healthy from time = $t
to time = now
. Hence, the last second of being healthy should be the first moment we consider the dashboard agent as unhealthy.
Note that, healthLastUpdateTime
currently does not strictly follow the definition above due to the check of head Pod status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, explained more offline and it makes sense. The main point I was missing is that when the health state has a transition, the status update actually goes through (there's a binary field isHealthy
which is not a timestamp, so if it changes the status actually updates). Also to simplify the issue, the unnecessary zero-downtime upgrade bug happens any time it stays at healthy for >600s and then transitions to unhealthy; it's not specifically related to a head pod restarting.
@kevin85421 One last question. With the new approach, if it's unhealthy for a while, then the first time it becomes healthy, we don't actually update HealthLastUpdateTime
. Instead, we'll update it in the next iteration. Is this your understanding as well? I think this off-by-one issue is still okay for a hotfix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as a hotfix.
For the long term: The core issue (we need to track how long a resource has been healthy, but we can't actually update the last healthy time in the status each time because it overloads the k8s api server) is pretty bad, but it also sounds like a pretty general k8s problem. @hongchaodeng do you know if there's a recommended way for dealing with this?
It never minds. We can still implement the correct logics for |
Why are these changes needed?
The RayService's health check mechanism is not well-defined. This PR is a hotfix for v1.0.0, and I will revisit the mechanism after the release. Hence, I do not spend much time on writing tests for this PR.
Change 1:
updateAndCheckDashboardStatus
Question: KubeRay will not update the status of the RayService custom resource when the only changes are related to timestamps. Without this PR, deleting the head Pod of a GCS ft-enabled RayCluster could potentially trigger an unnecessary zero-downtime upgrade immediately after the head Pod restarts.
With this PR, zero downtime upgrade will only be triggered when the dashboard agent has been unhealthy for more than
deploymentUnhealthySecondThreshold
seconds.Change 2:
reconcileServe
deploymentUnhealthySecondThreshold
seconds. The Ray head will fail and restart after the liveness probes fail 120 times consecutively (~600 seconds).deploymentUnhealthySecondThreshold
seconds.Related issue number
Closes #1565
Checks
deploymentUnhealthySecondThreshold
seconds. => No zero-downtime upgrade should be triggered.deploymentUnhealthySecondThreshold
seconds.