Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayCluster] support suspending worker groups #2663

Merged
merged 1 commit into from
Dec 27, 2024

Conversation

andrewsykim
Copy link
Collaborator

Why are these changes needed?

Support suspending worker groups in RayCluster

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@andrewsykim
Copy link
Collaborator Author

(will clean up implementation and add tests once we agree on the API)

@@ -776,6 +776,22 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
return err
}

// Delete all workers if worker group is suspended
if len(workerPods.Items) > 0 && worker.Suspend != nil && *worker.Suspend {
for _, workerPod := range workerPods.Items {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use DeleteAllOf instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeleteAllOf seems better, I didn't use it because of r.rayClusterScaleExpectation.ExpectScalePod. Do you know if it's okay to skip this call in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we don't call this for spec.suspend so seems fine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use r.deleteAllPods instead, which calls DeleteAllOf

@andrewsykim
Copy link
Collaborator Author

Added tests, PTAL

@andrewsykim andrewsykim changed the title [WIP][RayCluster] support suspending worker groups [RayCluster] support suspending worker groups Dec 18, 2024
@@ -776,6 +776,18 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
return err
}

// Delete all workers if worker group is suspended and skip reconcile
if worker.Suspend != nil && *worker.Suspend {
if _, err := r.deleteAllPods(ctx, common.RayClusterGroupPodsAssociationOptions(instance, worker.GroupName)); err != nil {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleteAllPods checks existing pods first, so this doesn't always call DeleteAllFor

@@ -776,6 +776,18 @@ func (r *RayClusterReconciler) reconcilePods(ctx context.Context, instance *rayv
return err
}

// Delete all workers if worker group is suspended and skip reconcile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a follow up issue: should we call deleteAllPods only when there are no in-flight requests? If so, I will add a follow up item in #2566.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: the expectation has already been checked by the following logic

if !r.rayClusterScaleExpectation.IsSatisfied(ctx, instance.Namespace, instance.Name, worker.GroupName) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe another follow up is to update expectations for deleteAllPods instead. Currently, we rely on deleteAllPods to check the number of Pods is more than 0 to avoid calling DeleteAllOf every reconciliation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't sure what to do about expectations, but it doesn't seem like we do anything for expectations in the standard suspend case with .spec.suspend

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it doesn't seem like we do anything for expectations in the standard suspend case with .spec.suspend

We need to address it in the future. You can check #2566 for more details, but I think it is fine not to handle expectations in this PR.

r.Recorder.Eventf(instance, corev1.EventTypeWarning, string(utils.FailedToDeleteWorkerPod),
"Failed deleting worker Pods for suspended group %s in RayCluster %s/%s, %v", worker.GroupName, instance.Namespace, instance.Name, err)
return errstd.Join(utils.ErrFailedDeleteWorkerPod, err)
}
Copy link
Member

@kevin85421 kevin85421 Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make the suspend operation atomic, like in #1798?

If this field is only used by the RayJob controller for now, it is fine for it not to be atomic at this moment. In that case, we should add a comment in raycluster_types.go to state that this field is not user-facing and add a check in validateRayClusterSpec to ensure that workerGroup.suspend is not false—only nil and true are valid. This is not ideal, but may be able to avoid most misconfig.

Does DeleteAllOf guarantee to delete all or nothing? Is it possible that only some Pods are deleted while others fail to be deleted? If DeleteAllOf guarantee to delete all or nothing, we don't need to worry about the atomicity.

Copy link
Collaborator Author

@andrewsykim andrewsykim Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think DeleteAllOf is atomic. Not sure if the same approach we took for RayJob will work, this means we'll need a condition per worker group to track whether the suspension is completed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expecting this field to be only for the RayJob controller and not user-facing, at least for v1.3.0? If so, we can use:

If this field is only used by the RayJob controller for now, it is fine for it not to be atomic at this moment. In that case, we should add a comment in raycluster_types.go to state that this field is not user-facing and add a check in validateRayClusterSpec to ensure that workerGroup.suspend is not false—only nil and true are valid. This is not ideal, but may be able to avoid most misconfig.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main use-case will be with RayJob, but I think it should be valid for other use-cases. We shouldn't stop or prevent users from using this field if they want to

@kevin85421
Copy link
Member

would you mind rebasing with the master branch to include this fix #2664?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up:

  1. Add warning to this API until it is atomic.
  2. Add a new event type.
  3. Make it atomic.

@kevin85421 kevin85421 merged commit a788963 into ray-project:master Dec 27, 2024
23 checks passed
rueian added a commit to rueian/ray that referenced this pull request Jan 10, 2025
Resolves ray-project/kuberay#2666.

ray-project/kuberay#2663 adds a new `suspend` field to the
KubeRay worker group spec for suspending worker groups. A suspended worker group should
be scaled to 0 and never be scaled up until the group is resumed.

Since there is no similar functionality in the `available_node_types` definition,
the best way to let the autoscaler know a worker group has been suspended is to
set its max_workers to 0 as well as its min_workers.

This PR makes the KubeRay autoscaling config producer produce a config with
both max_workers and min_workers set to 0 if the worker group has been suspended.
The autoscaler will periodically take the config and do its work.

Signed-off-by: Rueian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants