[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581

kevin85421 · 2023-10-28T04:22:16Z

Why are these changes needed?

The RayService's health check mechanism is not well-defined. This PR is a hotfix for v1.0.0, and I will revisit the mechanism after the release. Hence, I do not spend much time on writing tests for this PR.

Change 1: updateAndCheckDashboardStatus

Question: KubeRay will not update the status of the RayService custom resource when the only changes are related to timestamps. Without this PR, deleting the head Pod of a GCS ft-enabled RayCluster could potentially trigger an unnecessary zero-downtime upgrade immediately after the head Pod restarts.

# (without this PR)
# Step 1: Create a GCS ft-enabled RayService
kubectl apply -f ray-service.high-availability.yaml

# Step 2: Check the RayService status
kubectl describe rayservices.ray.io rayservice-ha
# Dashboard Status:
#   Health Last Update Time:  $t1
#   Is Healthy:               true
#   Last Update Time:         $t1

# Step 3: Wait more than `deploymentUnhealthySecondThreshold` (300 here) seconds ($t2 >= $t1 + 300).
# Delete the head Pod.
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
kubectl delete pod $HEAD_POD

# Step 4: `reconcileServe` will not send requests to the dashboard agent when the head Pod is not running and ready.

# Step 5: When the new head Pod is ready and running again, KubeRay starts sending requests
# to the dashboard agent and update the RayService CR status. In addition, the dashboard agent
# still needs several seconds to be ready because the readiness probe only checks Raylet and GCS.

# Step 6: $t3 is larger than $t1 + 300, so zero downtime upgrade is triggered accidentally.

With this PR, zero downtime upgrade will only be triggered when the dashboard agent has been unhealthy for more than deploymentUnhealthySecondThreshold seconds.

Change 2: reconcileServe
- Question: When we delete the dashboard agent process on the head Pod, the head Pod becomes not ready. Hence, the dashboard agent health check will be skipped. Hence, the zero-downtime upgrade will not be triggered after deploymentUnhealthySecondThreshold seconds. The Ray head will fail and restart after the liveness probes fail 120 times consecutively (~600 seconds).
- This PR skips the active RayCluster's head Pod status check. KubeRay still sends requests to the dashboard agent, and triggers a zero-downtime upgrade after deploymentUnhealthySecondThreshold seconds.

Related issue number

Closes #1565

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Manual test
- Create a GCS ft-enabled RayService
- Kill the head Pod after deploymentUnhealthySecondThreshold seconds. => No zero-downtime upgrade should be triggered.
- Kill the dashboard agent process on the head Pod. => Zero-downtime upgrade should be triggered after deploymentUnhealthySecondThreshold seconds.

architkulkarni · 2023-10-30T19:34:02Z

ray-operator/controllers/ray/rayservice_controller.go

 	rayServiceClusterStatus.DashboardStatus.LastUpdateTime = &timeNow
 	rayServiceClusterStatus.DashboardStatus.IsHealthy = isHealthy
-	if rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime.IsZero() || isHealthy {
+	if rayServiceClusterStatus.DashboardStatus.HealthLastUpdateTime.IsZero() || oldIsHealthy {


I don't understand the purpose of this change and how it fixes the first issue in the PR. Can you explain this more?

KubeRay will not update the status of the RayService custom resource when the only changes are related to timestamps

It sounds like this is the real issue. Why don't we just update HealthLastUpdateTime every time? Wouldn't that fix the issue?

It sounds like this is the real issue. Why don't we just update HealthLastUpdateTime every time?

This will overload the Kubernetes API Server.

I don't understand the purpose of this change and how it fixes the first issue in the PR. Can you explain this more?

By definition, healthLastUpdateTime: $t and isHealth: true mean that the dashboard agent is healthy from time = $t to time = now. Hence, the last second of being healthy should be the first moment we consider the dashboard agent as unhealthy.

Note that, healthLastUpdateTime currently does not strictly follow the definition above due to the check of head Pod status.

Thanks, explained more offline and it makes sense. The main point I was missing is that when the health state has a transition, the status update actually goes through (there's a binary field isHealthy which is not a timestamp, so if it changes the status actually updates). Also to simplify the issue, the unnecessary zero-downtime upgrade bug happens any time it stays at healthy for >600s and then transitions to unhealthy; it's not specifically related to a head pod restarting.

@kevin85421 One last question. With the new approach, if it's unhealthy for a while, then the first time it becomes healthy, we don't actually update HealthLastUpdateTime. Instead, we'll update it in the next iteration. Is this your understanding as well? I think this off-by-one issue is still okay for a hotfix.

architkulkarni

Looks good to me as a hotfix.

For the long term: The core issue (we need to track how long a resource has been healthy, but we can't actually update the last healthy time in the status each time because it overloads the k8s api server) is pretty bad, but it also sounds like a pretty general k8s problem. @hongchaodeng do you know if there's a recommended way for dealing with this?

kevin85421 · 2023-10-30T21:17:20Z

we need to track how long a resource has been healthy, but we can't actually update the last healthy time in the status each time because it overloads the k8s api server

It never minds. We can still implement the correct logics for deploymentUnhealthySecondThreshold. The core issue is that the current logic is too complicated. Both KubeRay and the container's readiness/liveness probes perform the health check.

update

dafc5a7

kevin85421 marked this pull request as ready for review October 28, 2023 06:23

kevin85421 assigned architkulkarni and hongchaodeng Oct 30, 2023

kevin85421 requested a review from architkulkarni October 30, 2023 18:41

architkulkarni reviewed Oct 30, 2023

View reviewed changes

architkulkarni approved these changes Oct 30, 2023

View reviewed changes

kevin85421 merged commit 165291e into ray-project:master Oct 30, 2023

kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Nov 2, 2023

[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade (ray-project#1581)

793c96a

kevin85421 added a commit that referenced this pull request Nov 2, 2023

[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade (#1581) (#1611)

09c70d2

This was referenced Nov 17, 2023

[RayService][Health-Check][1/n] Offload the health check responsibilities to K8s and RayCluster #1656

Merged

[RayService][Health-Check][2/n] Remove the hotfix to prevent unnecessary HTTP requests #1658

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581

[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581

kevin85421 commented Oct 28, 2023 •

edited

Loading

architkulkarni Oct 30, 2023

kevin85421 Oct 30, 2023

kevin85421 Oct 30, 2023

architkulkarni Oct 30, 2023

architkulkarni left a comment

kevin85421 commented Oct 30, 2023

[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581

[Hotfix][Bug] Avoid unnecessary zero-downtime upgrade #1581

Conversation

kevin85421 commented Oct 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni Oct 30, 2023

Choose a reason for hiding this comment

kevin85421 Oct 30, 2023

Choose a reason for hiding this comment

kevin85421 Oct 30, 2023

Choose a reason for hiding this comment

architkulkarni Oct 30, 2023

Choose a reason for hiding this comment

architkulkarni left a comment

Choose a reason for hiding this comment

kevin85421 commented Oct 30, 2023

kevin85421 commented Oct 28, 2023 •

edited

Loading