-
Notifications
You must be signed in to change notification settings - Fork 538
[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Test][Autoscaler] Add an E2E test for updating maxReplicas on a worker group #3623
Conversation
Signed-off-by: machichima <[email protected]>
@rueian PTAL, Thanks! |
// Update maxReplicas | ||
rayCluster, err = test.Client().Ray().RayV1().RayClusters(namespace.Name).Get(test.Ctx(), rayCluster.Name, metav1.GetOptions{}) | ||
g.Expect(err).NotTo(gomega.HaveOccurred()) | ||
rayCluster.Spec.WorkerGroupSpecs[0].MaxReplicas = ptr.To(rtc.updatedMax) | ||
rayCluster, err = test.Client().Ray().RayV1().RayClusters(namespace.Name).Update(test.Ctx(), rayCluster, metav1.UpdateOptions{}) | ||
g.Expect(err).NotTo(gomega.HaveOccurred()) | ||
|
||
// Trigger autoscaling with actors | ||
headPod, err := GetHeadPod(test, rayCluster) | ||
g.Expect(err).NotTo(gomega.HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this test, we should:
- Launch the workload.
- Verify that the cluster has the initial maximum number of workers.
- Update the maximum number of workers.
- Verify that the cluster has the expected number of replicas.
We may need to use normal tasks as the testing workload instead of using actors in this test. Actors will prevent the clusters from scaling down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
While submitting normal tasks by running script with ExecPodCmd
will block the process and wait until all tasks finish, I create one detached actor that will keep submitting tasks, which keeps maxReplicas
number of workers launching
Use this approach to enable scale up/down while modifying maxReplicas value Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
def stop(self): | ||
self.running = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def stop(self): | |
self.running = False |
I think submit_tasks
can't be stopped by this unless we use https://docs.ray.io/en/latest/ray-core/actors/async_api.html.
However, I think we can just remove self.running
entirely here because we will delete the cluster directly at the end of tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
while self.running: | ||
futures = [task.remote() for _ in range(num_tasks)] | ||
ray.get(futures) # wait for current batch to complete before next batch | ||
time.sleep(0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time.sleep(0.1) |
not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Thanks!
Signed-off-by: machichima <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cc @kevin85421 for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test seems to be unnecessary complex. How about creating detached actors only without submitting actor tasks?
|
||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--num-cpus", type=float, default=1) | ||
parser.add_argument("--num-gpus", type=float, default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My philosophy is only adding something when it is actually used. For example, --num-gpus
is not used here, so I will remove it.
Signed-off-by: machichima <[email protected]>
No problem! |
Signed-off-by: machichima <[email protected]>
maxReplicas
value changesmaxReplicas
increasesWhy are these changes needed?
Add an E2E test to
ray-operator/test/e2eautoscaler/raycluster_autoscaler_test.go
for testing the autoscaler can scale up and down nodes in a worker group to maxReplicas when users update it during the test.Related issue number
Closes #3616
Checks