-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayCluster] support suspending worker groups #2663
Conversation
6f5dbd0
to
c37f878
Compare
c37f878
to
a0206e2
Compare
(will clean up implementation and add tests once we agree on the API) |
@@ -776,6 +776,22 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv | |||
return err | |||
} | |||
|
|||
// Delete all workers if worker group is suspended | |||
if len(workerPods.Items) > 0 && worker.Suspend != nil && *worker.Suspend { | |||
for _, workerPod := range workerPods.Items { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use DeleteAllOf
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeleteAllOf seems better, I didn't use it because of r.rayClusterScaleExpectation.ExpectScalePod
. Do you know if it's okay to skip this call in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we don't call this for spec.suspend
so seems fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use r.deleteAllPods
instead, which calls DeleteAllOf
a0206e2
to
d008de0
Compare
Added tests, PTAL |
d008de0
to
afb2799
Compare
@@ -776,6 +776,18 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv | |||
return err | |||
} | |||
|
|||
// Delete all workers if worker group is suspended and skip reconcile | |||
if worker.Suspend != nil && *worker.Suspend { | |||
if _, err := r.deleteAllPods(ctx, common.RayClusterGroupPodsAssociationOptions(instance, worker.GroupName)); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleteAllPods
checks existing pods first, so this doesn't always call DeleteAllFor
@@ -776,6 +776,18 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv | |||
return err | |||
} | |||
|
|||
// Delete all workers if worker group is suspended and skip reconcile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a follow up issue: should we call deleteAllPods
only when there are no in-flight requests? If so, I will add a follow up item in #2566.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: the expectation has already been checked by the following logic
if !r.rayClusterScaleExpectation.IsSatisfied(ctx, instance.Namespace, instance.Name, worker.GroupName) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe another follow up is to update expectations
for deleteAllPods
instead. Currently, we rely on deleteAllPods
to check the number of Pods is more than 0 to avoid calling DeleteAllOf
every reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I wasn't sure what to do about expectations, but it doesn't seem like we do anything for expectations in the standard suspend case with .spec.suspend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it doesn't seem like we do anything for expectations in the standard suspend case with .spec.suspend
We need to address it in the future. You can check #2566 for more details, but I think it is fine not to handle expectations in this PR.
r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.FailedToDeleteWorkerPod), | ||
"Failed deleting worker Pods for suspended group %s in RayCluster %s/%s, %v", worker.GroupName, instance.Namespace, instance.Name, err) | ||
return errstd.Join(utils.ErrFailedDeleteWorkerPod, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the suspend operation atomic, like in #1798?
If this field is only used by the RayJob controller for now, it is fine for it not to be atomic at this moment. In that case, we should add a comment in raycluster_types.go
to state that this field is not user-facing and add a check in validateRayClusterSpec
to ensure that workerGroup.suspend
is not false
—only nil
and true
are valid. This is not ideal, but may be able to avoid most misconfig.
Does DeleteAllOf
guarantee to delete all or nothing? Is it possible that only some Pods are deleted while others fail to be deleted? If DeleteAllOf
guarantee to delete all or nothing, we don't need to worry about the atomicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think DeleteAllOf is atomic. Not sure if the same approach we took for RayJob will work, this means we'll need a condition per worker group to track whether the suspension is completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expecting this field to be only for the RayJob controller and not user-facing, at least for v1.3.0? If so, we can use:
If this field is only used by the RayJob controller for now, it is fine for it not to be atomic at this moment. In that case, we should add a comment in raycluster_types.go to state that this field is not user-facing and add a check in validateRayClusterSpec to ensure that workerGroup.suspend is not false—only nil and true are valid. This is not ideal, but may be able to avoid most misconfig.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main use-case will be with RayJob, but I think it should be valid for other use-cases. We shouldn't stop or prevent users from using this field if they want to
would you mind rebasing with the master branch to include this fix #2664? |
Signed-off-by: Andrew Sy Kim <[email protected]>
afb2799
to
1c61b3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up:
- Add warning to this API until it is atomic.
- Add a new event type.
- Make it atomic.
Resolves ray-project/kuberay#2666. ray-project/kuberay#2663 adds a new `suspend` field to the KubeRay worker group spec for suspending worker groups. A suspended worker group should be scaled to 0 and never be scaled up until the group is resumed. Since there is no similar functionality in the `available_node_types` definition, the best way to let the autoscaler know a worker group has been suspended is to set its max_workers to 0 as well as its min_workers. This PR makes the KubeRay autoscaling config producer produce a config with both max_workers and min_workers set to 0 if the worker group has been suspended. The autoscaler will periodically take the config and do its work. Signed-off-by: Rueian <[email protected]>
Why are these changes needed?
Support suspending worker groups in RayCluster
Related issue number
Checks