[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623

machichima · 2025-05-17T03:32:10Z

Check if cluster can scale up/down workers when maxReplicas value changes
Launch an actor that continuously submits tasks to simulate user workload, triggering scale-up when maxReplicas increases

Why are these changes needed?

Add an E2E test to ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go for testing the autoscaler can scale up and down nodes in a worker group to maxReplicas when users update it during the test.

Related issue number

Closes #3616

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Signed-off-by: machichima <[email protected]>

machichima · 2025-05-17T04:09:18Z

@rueian PTAL, Thanks!

rueian · 2025-05-17T22:19:59Z

ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go

+				// Update maxReplicas
+				rayCluster, err = test.Client().Ray().RayV1().RayClusters(namespace.Name).Get(test.Ctx(), rayCluster.Name, metav1.GetOptions{})
+				g.Expect(err).NotTo(gomega.HaveOccurred())
+				rayCluster.Spec.WorkerGroupSpecs[0].MaxReplicas = ptr.To(rtc.updatedMax)
+				rayCluster, err = test.Client().Ray().RayV1().RayClusters(namespace.Name).Update(test.Ctx(), rayCluster, metav1.UpdateOptions{})
+				g.Expect(err).NotTo(gomega.HaveOccurred())
+
+				// Trigger autoscaling with actors
+				headPod, err := GetHeadPod(test, rayCluster)
+				g.Expect(err).NotTo(gomega.HaveOccurred())


In this test, we should:

Launch the workload.

Verify that the cluster has the initial maximum number of workers.

Update the maximum number of workers.

Verify that the cluster has the expected number of replicas.

We may need to use normal tasks as the testing workload instead of using actors in this test. Actors will prevent the clusters from scaling down.

Fixed!
While submitting normal tasks by running script with ExecPodCmd will block the process and wait until all tasks finish, I create one detached actor that will keep submitting tasks, which keeps maxReplicas number of workers launching

Use this approach to enable scale up/down while modifying maxReplicas value Signed-off-by: machichima <[email protected]>

Signed-off-by: machichima <[email protected]>

rueian · 2025-05-18T07:07:17Z

ray-operator/test/e2eautoscaler/create_detached_actor_submit_task.py

+    def stop(self):
+        self.running = False


Suggested change

def stop(self):

self.running = False

I think submit_tasks can't be stopped by this unless we use https://docs.ray.io/en/latest/ray-core/actors/async_api.html.

However, I think we can just remove self.running entirely here because we will delete the cluster directly at the end of tests.

Fixed. Thanks!

rueian · 2025-05-18T07:08:06Z

ray-operator/test/e2eautoscaler/create_detached_actor_submit_task.py

+        while self.running:
+            futures = [task.remote() for _ in range(num_tasks)]
+            ray.get(futures)  # wait for current batch to complete before next batch
+            time.sleep(0.1)


Suggested change

time.sleep(0.1)

not necessary.

Fixed. Thanks!

Signed-off-by: machichima <[email protected]>

rueian

LGTM

rueian · 2025-05-19T16:47:28Z

cc @kevin85421 for review

kevin85421

This test seems to be unnecessary complex. How about creating detached actors only without submitting actor tasks?

kevin85421 · 2025-05-19T18:00:49Z

ray-operator/test/e2eautoscaler/create_detached_actor_submit_task.py

+
+parser = argparse.ArgumentParser()
+parser.add_argument("--num-cpus", type=float, default=1)
+parser.add_argument("--num-gpus", type=float, default=0)


My philosophy is only adding something when it is actually used. For example, --num-gpus is not used here, so I will remove it.

Signed-off-by: machichima <[email protected]>

machichima · 2025-05-20T02:03:33Z

This test seems to be unnecessary complex. How about creating detached actors only without submitting actor tasks?

No problem!
It seems like the actors here will not prevent the clusters from scaling down. Just updated
Thanks!

Signed-off-by: machichima <[email protected]>

test: scale up/down pod to maxReplicas

576bf30

Signed-off-by: machichima <[email protected]>

rueian reviewed May 17, 2025

View reviewed changes

machichima added 2 commits May 18, 2025 14:14

test: use actor to create task to trigger scale up/down

5c8ab9a

Use this approach to enable scale up/down while modifying maxReplicas value Signed-off-by: machichima <[email protected]>

refactor: remove unused/fix comment

b40575e

Signed-off-by: machichima <[email protected]>

machichima requested a review from rueian May 18, 2025 06:28

rueian reviewed May 18, 2025

View reviewed changes

fix: remove unnecessary code

5b258ee

Signed-off-by: machichima <[email protected]>

machichima requested a review from rueian May 18, 2025 09:04

rueian approved these changes May 18, 2025

View reviewed changes

rueian requested a review from kevin85421 May 19, 2025 16:47

kevin85421 reviewed May 19, 2025

View reviewed changes

fix: test with detached actors

95e8bf4

Signed-off-by: machichima <[email protected]>

fix: remove unused py script

bb5290e

Signed-off-by: machichima <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623

[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623

machichima commented May 17, 2025 •

edited

Loading

machichima commented May 17, 2025

rueian May 17, 2025

machichima May 18, 2025 •

edited

Loading

rueian May 18, 2025 •

edited

Loading

machichima May 18, 2025

rueian May 18, 2025

machichima May 18, 2025

rueian left a comment

rueian commented May 19, 2025

kevin85421 left a comment

kevin85421 May 19, 2025

machichima commented May 20, 2025

[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623

Are you sure you want to change the base?

[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623

Conversation

machichima commented May 17, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

machichima commented May 17, 2025

rueian May 17, 2025

Choose a reason for hiding this comment

machichima May 18, 2025 • edited Loading

Choose a reason for hiding this comment

rueian May 18, 2025 • edited Loading

Choose a reason for hiding this comment

machichima May 18, 2025

Choose a reason for hiding this comment

rueian May 18, 2025

Choose a reason for hiding this comment

machichima May 18, 2025

Choose a reason for hiding this comment

rueian left a comment

Choose a reason for hiding this comment

rueian commented May 19, 2025

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 May 19, 2025

Choose a reason for hiding this comment

machichima commented May 20, 2025

machichima commented May 17, 2025 •

edited

Loading

machichima May 18, 2025 •

edited

Loading

rueian May 18, 2025 •

edited

Loading