Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService][Hotfix] Hotfix for Flaky Zero Downtime Rollout Test #1837

Merged
merged 2 commits into from
Jan 17, 2024

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented Jan 17, 2024

Why are these changes needed?

The Zero Downtime Rollout Test is flaky for the following reason:

The current logic for the Zero Downtime Rollout operates as follows:

  1. Once the pending RayCluster is ready for service, the RayService Controller updates its status to 'running' and updates the selection label for the serve service.
  2. The RayService Controller then waits for 60 seconds before deleting the old RayCluster. This 60-second interval allows time for the serve service to be updated.

However, a critical issue is that the 'running' status merely indicates the readiness of the new RayCluster to serve traffic; it does not confirm that the serve service has begun redirecting traffic to the new RayCluster. As a result, requests sent after the status change to 'running' might still be responded to by the old RayCluster.

This scenario may lead to high-probability assertion errors in RayService CI tests, as responses expected from the new RayCluster may still coming from the old RayCluster. Some examples can be found here:

To mitigate this, and before refactoring the Rollout logic and redefine the of status, this PR adds delay to ensure the serve service is ready to redirect traffic to the new RayCluster.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: Yicheng-Lu-llll <[email protected]>
@kevin85421 kevin85421 self-assigned this Jan 17, 2024
@kevin85421 kevin85421 self-requested a review January 17, 2024 21:08
@kevin85421 kevin85421 changed the title [RayService] Fix for Flaky Zero Downtime Rollout Test [RayService][Hotfix] Hotfix for Flaky Zero Downtime Rollout Test Jan 17, 2024
@kevin85421
Copy link
Member

Some failures in this build appear to be related to this issue. @Yicheng-Lu-llll would you mind confirming it? Thanks!

Signed-off-by: Yicheng-Lu-llll <[email protected]>
@Yicheng-Lu-llll
Copy link
Contributor Author

Yes, the CI failures are due to the issue mentioned here. I've included them in the sample section of my PR description. Thank you!

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't succeeded in reproducing the issue on my laptop. However, I encountered the issue twice when I ran the test three times in this build. After I add the hotfix to my PR, it passes the first time. Although the data points are limited, this may still somewhat validate the assumptions in this PR.

@kevin85421 kevin85421 merged commit aad2fc6 into ray-project:master Jan 17, 2024
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants