Skip to content

[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

machichima
Copy link
Contributor

@machichima machichima commented May 17, 2025

  • Check if cluster can scale up/down workers when maxReplicas value changes
  • Launch an actor that continuously submits tasks to simulate user workload, triggering scale-up when maxReplicas increases

Why are these changes needed?

Add an E2E test to ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go for testing the autoscaler can scale up and down nodes in a worker group to maxReplicas when users update it during the test.

Related issue number

Closes #3616

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@machichima
Copy link
Contributor Author

@rueian PTAL, Thanks!

Comment on lines 422 to 431
// Update maxReplicas
rayCluster, err = test.Client().Ray().RayV1().RayClusters(namespace.Name).Get(test.Ctx(), rayCluster.Name, metav1.GetOptions{})
g.Expect(err).NotTo(gomega.HaveOccurred())
rayCluster.Spec.WorkerGroupSpecs[0].MaxReplicas = ptr.To(rtc.updatedMax)
rayCluster, err = test.Client().Ray().RayV1().RayClusters(namespace.Name).Update(test.Ctx(), rayCluster, metav1.UpdateOptions{})
g.Expect(err).NotTo(gomega.HaveOccurred())

// Trigger autoscaling with actors
headPod, err := GetHeadPod(test, rayCluster)
g.Expect(err).NotTo(gomega.HaveOccurred())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, we should:

  1. Launch the workload.
  2. Verify that the cluster has the initial maximum number of workers.
  3. Update the maximum number of workers.
  4. Verify that the cluster has the expected number of replicas.

We may need to use normal tasks as the testing workload instead of using actors in this test. Actors will prevent the clusters from scaling down.

Copy link
Contributor Author

@machichima machichima May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!
While submitting normal tasks by running script with ExecPodCmd will block the process and wait until all tasks finish, I create one detached actor that will keep submitting tasks, which keeps maxReplicas number of workers launching

Use this approach to enable scale up/down while modifying maxReplicas
value

Signed-off-by: machichima <[email protected]>
@machichima machichima requested a review from rueian May 18, 2025 06:28
Comment on lines 35 to 36
def stop(self):
self.running = False
Copy link
Contributor

@rueian rueian May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def stop(self):
self.running = False

I think submit_tasks can't be stopped by this unless we use https://docs.ray.io/en/latest/ray-core/actors/async_api.html.

However, I think we can just remove self.running entirely here because we will delete the cluster directly at the end of tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

while self.running:
futures = [task.remote() for _ in range(num_tasks)]
ray.get(futures) # wait for current batch to complete before next batch
time.sleep(0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time.sleep(0.1)

not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

@machichima machichima requested a review from rueian May 18, 2025 09:04
Copy link
Contributor

@rueian rueian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rueian
Copy link
Contributor

rueian commented May 19, 2025

cc @kevin85421 for review

@rueian rueian requested a review from kevin85421 May 19, 2025 16:47
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems to be unnecessary complex. How about creating detached actors only without submitting actor tasks?


parser = argparse.ArgumentParser()
parser.add_argument("--num-cpus", type=float, default=1)
parser.add_argument("--num-gpus", type=float, default=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My philosophy is only adding something when it is actually used. For example, --num-gpus is not used here, so I will remove it.

@machichima
Copy link
Contributor Author

This test seems to be unnecessary complex. How about creating detached actors only without submitting actor tasks?

No problem!
It seems like the actors here will not prevent the clusters from scaling down. Just updated
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group
3 participants