-
Notifications
You must be signed in to change notification settings - Fork 538
[RayService] don't update serveConfigV2 in current ray cluster if ray… #3559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[RayService] don't update serveConfigV2 in current ray cluster if ray… #3559
Conversation
This is the screenshot of reproduction. After applying the fix: (the script periodically check if the new serveConfigV2 has been updated or not until the ray cluster has been terminated.) Attach the reproduction script for your reference. |
@kevin85421 PTAL |
… cluster spec has changed at the same time. Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
e54f5ee
to
c02c81f
Compare
if err != nil && | ||
(strings.Contains(err.Error(), "exit code 52") || | ||
strings.Contains(err.Error(), "exit code 6")) { | ||
// which is "curl: (52) Empty reply from server" or "curl: (6) Couldn't resolve host" | ||
// it implies the ray cluster is terminated. | ||
break | ||
} | ||
|
||
// Except the errors above, it should not occur. | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test condition is not that straight forward. It would keep looping because it should check serveConfigV2 is not modified until it terminated.
There are a few of explanations about these conditions.
- check the request to dashboard is unreachable, which implies the ray cluster terminated. it will exit the loop. (code 52 or 6 might happen). This is test pass if the following condition is not happened.
- check the serveConfigV2 is not contain updated price until the ray cluster terminated. If it contains the updated price, test fails immediately.
Hi @kevin85421 , the test is ready. PTAL |
cc @rueian would you mind reviewing this PR? Thanks. |
// Except the errors above, it should not occur. | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me, but I am afraid that this g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster")
could never hit due to timing races. Could we simply wait for the old cluster to be destroyed and verify we don't reconcileServe
on it by checking there are no new UpdatedServeApplications
or FailedToUpdateServeApplications
events on the cluster?
@@ -82,3 +86,97 @@ func TestRayServiceInPlaceUpdate(t *testing.T) { | |||
g.Expect(stdout.String()).To(Equal("9 pizzas please!")) | |||
}, TestTimeoutShort).Should(Succeed()) | |||
} | |||
|
|||
func TestRayServiceInPlaceUpdateWithRayClusterSpec(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have another test for zero time upgrade enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like enabled already.
There is no upgradeStrategy
in RayServiceSampleYamlApplyConfiguration()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, then can we have a test for zero downtime upgrade being disabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit of confusing about what it should do if zero downtime upgrade is disabled.
If it is disabled, it seems the new RayCluster wouldn't be created. The serveConfigV2 will be update on the existed cluster. Which might be an potential issue that this pr is trying to solve. Because ShouldPrepareNewCluster(...) wouldn't be true under these conditions.
However, if serveConfigV2 is not allow to update on the existed cluster under zero downtime upgrade is disabled with RayClusterSpec and serveConfigV2 modified at the same time. It might violate the design https://docs.ray.io/en/master/cluster/kubernetes/user-guides/rayservice.html#step-7-in-place-update-for-ray-serve-applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is disabled, it seems the new RayCluster wouldn't be created. The serveConfigV2 will be update on the existed cluster.
I think that is the correct behavior. Is there any existing test guarantee that already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems no such test so add a test without zero downtime upgrade.
…lications happen or not Signed-off-by: fscnick <[email protected]>
g.Eventually(func() string { | ||
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name) | ||
if err == nil { | ||
return "" | ||
} | ||
return err.Error() | ||
}, TestTimeoutMedium).Should(ContainSubstring(fmt.Sprintf("rayclusters.ray.io \"%s\" not found", activeRayClusterBeforeUpdate.Name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
g.Eventually(func() string { | |
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name) | |
if err == nil { | |
return "" | |
} | |
return err.Error() | |
}, TestTimeoutMedium).Should(ContainSubstring(fmt.Sprintf("rayclusters.ray.io \"%s\" not found", activeRayClusterBeforeUpdate.Name))) | |
g.Eventually(func() bool { | |
_, err := GetRayCluster(test, activeRayClusterBeforeUpdate.Namespace, activeRayClusterBeforeUpdate.Name) | |
return k8sApiErrors.IsNotFound(err) | |
}, TestTimeoutMedium).Should(BeTrue()) |
|
||
// event UpdatedServeApplications or FailedToUpdateServeApplications should not occur on activeRayClusterBeforeUpdate after RayService update. | ||
for _, event := range events { | ||
if event.CreationTimestamp.Before(updateTimestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we test the LastTimestamp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LastTimestamp
might be later than CreationTimestamp
which might need to consider more situation. It would be simple to choose CreationTimestamp
. Kindly let me know if there are any other concern.
CreationTimestamp
definition:
https://github.com/kubernetes/apimachinery/blob/202cba0f14e50d9b4106b34623f8c8056db3f86e/pkg/apis/meta/v1/types.go#L179-L188
FirstTimestamp
and LastTimestamp
definition:
https://github.com/kubernetes/api/blob/f7e72be095ee5a3e60d9df2f79b6591ee329734a/core/v1/types.go#L7205-L7211
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LastTimestamp could be later than CreationTimestamp because k8s will aggregate the same event into the old one, therefore if you skip events by the CreationTimestamp, you might skip all the events.
How about you first check if the reason is what we want to test, and then test the LastTimestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the LastTimestamp should record the event we update to the new cluster. How about test the Count = 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is okay to test the Count == 2. It seems message are excluded from aggregation key so that it is not knowing that belong to which RayCluster. However, if it happens more than 2 or less than 2, it definitely goes wrong. But it seems no other information in the event to tell which is which.
@@ -90,6 +90,22 @@ func CurlRayServicePod( | |||
return ExecPodCmd(t, curlPod, curlPodContainerName, cmd) | |||
} | |||
|
|||
func CurlRayClusterDashboard( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this and the ExecPodCmdReturnError can be removed.
ray-operator/test/support/events.go
Outdated
@@ -128,3 +131,15 @@ func getWhitespaceStr(size int) string { | |||
} | |||
return whiteSpaceStr | |||
} | |||
|
|||
func RayServiceEvents(t Test, rayService *rayv1.RayService) ([]corev1.Event, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This by itself seems too small to be a helper.
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
|
||
// event UpdatedServeApplications or FailedToUpdateServeApplications should not occur on activeRayClusterBeforeUpdate after RayService update. | ||
for _, event := range events { | ||
if event.CreationTimestamp.Before(updateTimestamp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LastTimestamp could be later than CreationTimestamp because k8s will aggregate the same event into the old one, therefore if you skip events by the CreationTimestamp, you might skip all the events.
How about you first check if the reason is what we want to test, and then test the LastTimestamp?
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
updateTimestamp := rayService.Status.LastUpdateTime | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a curl test to make sure the new serve config is updated on the new cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add it after checking the old RayCluster has terminated.
Signed-off-by: fscnick <[email protected]>
3c44f1d
to
78c45cc
Compare
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
LGTM. cc @kevin85421 for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
} | ||
count++ | ||
} | ||
g.Expect(count).To(Equal(2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also check these two events? One is for the old RayCluster, and the other is for the new RayCluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RayCluster name appears on message
which might be exclude from aggregation. It would be difficult to distinguish which is which. There are some discussion about this condition #3559 (comment). Or do I misunderstand something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, why can we check the number of events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discuss offline:
I am OK to merge the PR if:
- Survey when will aggregation be triggered.
- It can pass 20 times consecutively in your local env.
- Check these two events are for different RayCluster CRs
- Add a comment / TODO for the possibility of aggregation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Aggregation is performed if a similar event is recorded 10 times in a 10 minute rolling interval.
https://github.com/kubernetes/client-go/blob/46965213e4561ad1b9c585d1c3551a0cc8d3fcd6/tools/record/events_cache.go#L430-L432 -
Here is the screenshot

- Compare two RayClusters name from the event message at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aggregation is performed if a similar event is recorded 10 times in a 10 minute rolling interval.
https://github.com/kubernetes/client-go/blob/46965213e4561ad1b9c585d1c3551a0cc8d3fcd6/tools/record/events_cache.go#L430-L432
Oh, sorry. In that case, aggregation is quite likely to happen since we share the Kubernetes cluster across multiple end-to-end tests. We might need an alternative way to verify that the Ray Serve applications in the old RayCluster are not being updated.
A possible way is:
- Add a finalizer to old RayCluster
- Trigger zero downtime upgrade
- Wait until RayService switches to the new RayCluster
- Verify the old RayCluster's Serve applications are not updated.
- Remove the finalizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. PTAL
) | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
// Wait active RayCluster has been terminated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use logic similar to
LogWithTimestamp(test.T(), "Waiting for RayService %s/%s UpgradeInProgress condition to be true", newRayService.Namespace, newRayService.Name) |
Maybe we can add a utility function to wait until the RayService switches to the new RayCluster.
g.Expect(rayService).NotTo(BeNil()) | ||
|
||
// Create curl pod | ||
curlPodName := "curl-pod" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up: Update CreateCurlPod to include the logic to wait until it is running and ready. Can you open an issue to track the progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created. #3613
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
…dated Signed-off-by: fscnick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rueian would you mind reviewing another iteration? I will do an iteration later.
finalizers := activeRayClusterBeforeUpdate.Finalizers | ||
activeRayClusterBeforeUpdate.Finalizers = activeRayClusterBeforeUpdate.Finalizers[:0] | ||
for _, finalizer := range finalizers { | ||
if finalizer == "no-existed-finalizer" { | ||
continue | ||
} | ||
activeRayClusterBeforeUpdate.Finalizers = append(activeRayClusterBeforeUpdate.Finalizers, finalizer) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finalizers := activeRayClusterBeforeUpdate.Finalizers | |
activeRayClusterBeforeUpdate.Finalizers = activeRayClusterBeforeUpdate.Finalizers[:0] | |
for _, finalizer := range finalizers { | |
if finalizer == "no-existed-finalizer" { | |
continue | |
} | |
activeRayClusterBeforeUpdate.Finalizers = append(activeRayClusterBeforeUpdate.Finalizers, finalizer) | |
} | |
activeRayClusterBeforeUpdate.Finalizers = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just clear it?
Signed-off-by: fscnick <[email protected]>
stdout, _ = CurlRayServicePod(test, rayService, curlPod, curlContainerName, "/calc", `["MUL", 3]`) | ||
g.Expect(stdout.String()).To(Equal("15 pizzas please!")) | ||
|
||
// In-place update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not in-place update.
// In-place update | |
// Update both `serveConfigV2` and the Ray head's environment variables. | |
// Then, verify the following two conditions: | |
// (1) The new `serveConfigV2` is not applied to the old RayCluster. | |
// (2) A zero-downtime upgrade is triggered. |
activeRayClusterBeforeUpdate, err = GetRayCluster(test, namespace.Name, rayService.Status.ActiveServiceStatus.RayClusterName) | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
// apply a non-existed-finalizer to the active ray cluster to avoiding be removed before checking the serveConfigV2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// apply a non-existed-finalizer to the active ray cluster to avoiding be removed before checking the serveConfigV2 | |
// Apply a finalizer to the active RayCluster to prevent it from being deleted after the zero-downtime upgrade process is complete. |
g.Expect(stdout.String()).To(Equal("15 pizzas please!")) | ||
|
||
// In-place update | ||
// Parse ServeConfigV2 and replace the string in the simplest way to update it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Parse ServeConfigV2 and replace the string in the simplest way to update it. |
serveConfig := rayService.Spec.ServeConfigV2 | ||
serveConfig = strings.Replace(serveConfig, "price: 3", "price: 456", -1) | ||
rayService.Spec.ServeConfigV2 = serveConfig | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Make sure the old ray cluster is not updated. | ||
stdout, _, err = CurlRayClusterDashboard(test, activeRayClusterBeforeUpdate, curlPod, curlContainerName, utils.DeployPathV2) | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
g.Expect(stdout.String()).NotTo(ContainSubstring("\"price\": 456"), "new price should not be updated on the old ray cluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of possible cases that the stdout
doesn't contain NotTo
. It's better to query /fruit
and check the results.
@@ -6,8 +6,11 @@ import ( | |||
|
|||
. "github.com/onsi/gomega" | |||
corev1 "k8s.io/api/core/v1" | |||
k8sApiErrors "k8s.io/apimachinery/pkg/api/errors" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is k8s import convention
k8sApiErrors "k8s.io/apimachinery/pkg/api/errors" | |
k8serrors "k8s.io/apimachinery/pkg/api/errors" |
) | ||
g.Expect(err).NotTo(HaveOccurred()) | ||
|
||
// make sure the old ray cluster is removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this might take a long time. We only need to remove the finalizer. Can you measure how long L190 - L194 take? If it is long, we can remove this Eventually
logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it take elapsed time: 1m0.446727917s
out of 114.902s
. It might be a bit of long. Remove in advance. Kindly let me know if you'd like to keep it.
stdout, _ = CurlRayServicePod(test, rayService, curlPod, curlContainerName, "/calc", `["MUL", 3]`) | ||
g.Expect(stdout.String()).To(Equal("15 pizzas please!")) | ||
|
||
// In-place update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// In-place update | |
// Update both `serveConfigV2` and the Ray head's environment variables. | |
// Since `upgradeStrategy` is set to `None`, a zero-downtime upgrade will not be triggered, | |
// and the new `serveConfigV2` will be applied to the existing RayCluster. |
g.Expect(stdout.String()).To(Equal("15 pizzas please!")) | ||
|
||
// In-place update | ||
// Parse ServeConfigV2 and replace the string in the simplest way to update it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Parse ServeConfigV2 and replace the string in the simplest way to update it. |
@@ -90,6 +91,21 @@ func CurlRayServicePod( | |||
return ExecPodCmd(t, curlPod, curlPodContainerName, cmd) | |||
} | |||
|
|||
func CurlRayClusterDashboard(t Test, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to expose this function (use the uppercase)?
Signed-off-by: fscnick <[email protected]>
Signed-off-by: fscnick <[email protected]>
Why are these changes needed?
When the serveConfigV2 and RayCluster has changed at the same time, it would also update the current active ray cluster with the new serveConfigV2. It might be possible to cause a failure if something in the runtime environment is different.
Related issue number
Closes #3423
Checks