Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService] Refactor to Rely More on RayService Status in RayService E2E Tests #1928

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented Feb 18, 2024

Why are these changes needed?

After this PR:

  • RayServiceAddCREvent uses serviceStatus and numServeEndpoints to check if it has converged. This addresses the issue described here: [RayService][HA] Fix flaky tests #1823 (comment). Checking serviceStatus ensures serve applications within the RayCluster are prepared to handle incoming traffic, and verifying numServeEndpoints > 0 confirms that the k8s serve service is ready to redirect traffic to the RayCluster.

  • RayServiceUpdateCREvent also uses serviceStatus and numServeEndpoints to check if it has converged. This addresses the issue described here: Remove Extra Delay for Zero Downtime Rollout Test #1838 as numServeEndpoints > 0 can potentially indicate whether the serve service is ready to redirect traffic.

  • RayServiceDeleteCREvent checks for convergence based on the existence of RayService and RayCluster objects. This is similar to how RayJob e2e tests check for deletion.

Related issue number

Closes #1838

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@kevin85421 kevin85421 self-assigned this Feb 21, 2024
@kevin85421 kevin85421 self-requested a review February 21, 2024 18:58
(2) All head pods and worker pods are "Running".
(3) Service named "rayservice-sample-serve" presents
(1) serviceStatus is "Running", This means serve applications in Raycluster are ready to serve incoming traffic.
(2) numServeEndpoints > 0, This means the k8s serve service is ready to redirect traffic to the Raycluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(2) numServeEndpoints > 0, This means the k8s serve service is ready to redirect traffic to the Raycluster.
(2) numServeEndpoints > 0: This means the k8s serve service is ready to redirect traffic to the RayCluster.

(1) The number of head pods and worker pods are as expected.
(2) All head pods and worker pods are "Running".
(3) Service named "rayservice-sample-serve" presents
(1) serviceStatus is "Running", This means serve applications in Raycluster are ready to serve incoming traffic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(1) serviceStatus is "Running", This means serve applications in Raycluster are ready to serve incoming traffic.
(1) serviceStatus is "Running": This means serve applications in RayCluster are ready to serve incoming traffic.

Signed-off-by: Yicheng-Lu-llll <[email protected]>
@kevin85421 kevin85421 merged commit e616dc4 into ray-project:master Feb 21, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove Extra Delay for Zero Downtime Rollout Test
2 participants