-
Notifications
You must be signed in to change notification settings - Fork 538
[Feature] [kubectl-plugin] Expose setting shutdownAfterJobFinishes
and ttlSecondsAfterFinished
in ray job submit
#3627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
shutdownAfterJobFinishes bool | ||
ttlSecondsAfterFinished int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add shutdownAfterJobFinishes
and ttlSecondsAfterFinished
if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 { | ||
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement validation logic when --filename
is provided
if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.
} else if options.fileName == "" { | ||
if !options.shutdownAfterJobFinishes && options.ttlSecondsAfterFinished != 0 { | ||
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement validation logic when --filename
is not provided
if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.
@@ -30,7 +30,7 @@ func TestRayJobSubmitComplete(t *testing.T) { | |||
assert.Equal(t, "fake/path/to/env/yaml", fakeSubmitJobOptions.runtimeEnv) | |||
} | |||
|
|||
func TestRayJobSubmitValidate(t *testing.T) { | |||
func TestRayJobSubmitWithYamlValidate(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split TestRayJobSubmitValidate
into two test cases: one for YAML-based submission and one for non-YAML-based submission.
@@ -99,6 +122,52 @@ spec: | |||
} | |||
} | |||
|
|||
func TestRayJobSubmitWithoutYamlValidate(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split TestRayJobSubmitValidate
into two test cases: one for YAML-based submission and one for non-YAML-based submission.
if !rayJob.Spec.ShutdownAfterJobFinishes && rayJob.Spec.TTLSecondsAfterFinished != 0 { | ||
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished is non-zero") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement validation logic in ray operator
if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.
Hi @MortalHappiness and @kevin85421 PTAL |
…idation and plugin
cc @davidxia would you mind reviewing this PR? Thanks! |
@@ -254,8 +258,17 @@ func (options *SubmitJobOptions) Validate() error { | |||
} | |||
options.runtimeEnvJson = string(runtimeJson) | |||
} | |||
|
|||
if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 { | |||
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better to use the YAML field names ttlSecondsAfterFinished
and shutdownAfterJobFinishes
here
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true") | |
return fmt.Errorf("ttlSecondsAfterFinished is only supported when shutdownAfterJobFinishes is 'true'") |
@@ -254,8 +258,17 @@ func (options *SubmitJobOptions) Validate() error { | |||
} | |||
options.runtimeEnvJson = string(runtimeJson) | |||
} | |||
|
|||
if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to this PR, but when I was testing this change, I ran kubectl ray job submit -f ray-job.sample.yaml --working-dir . --shutdown-after-job-finishes --ttl-seconds-after-finished 600 -- python /home/ray/samples/sample_code.py
and wondered why the new flags weren't used. I had to look at this source to understand -f/--filename
makes the command ignore many of these other flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I think we should take the filename as a base, and flags should override the values in the file.
I created an issue to track this #3639
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior of kubectl ray create cluster
is to error with an actionable error message if declarative file and CLI flags are used together. We can consider doing that here too if we think it's not worth it to have override logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth overriding the values in the file. I will change this.
} else if strings.TrimSpace(options.rayjobName) == "" { | ||
return fmt.Errorf("Must set either yaml file (--filename) or set Ray job name (--name)") | ||
} else if options.fileName == "" { | ||
if !options.shutdownAfterJobFinishes && options.ttlSecondsAfterFinished != 0 { | ||
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true") | |
return fmt.Errorf("--ttl-seconds-after-finished is only supported when --shutdown-after-job-finishes is set to true") |
@@ -144,6 +144,10 @@ func ValidateRayJobSpec(rayJob *rayv1.RayJob) error { | |||
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false is not allowed to be suspended") | |||
} | |||
|
|||
if !rayJob.Spec.ShutdownAfterJobFinishes && rayJob.Spec.TTLSecondsAfterFinished != 0 { | |||
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished is non-zero") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished is non-zero") | |
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished") |
} else if strings.TrimSpace(options.rayjobName) == "" { | ||
return fmt.Errorf("Must set either yaml file (--filename) or set Ray job name (--name)") | ||
} else if options.fileName == "" { | ||
if !options.shutdownAfterJobFinishes && options.ttlSecondsAfterFinished != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also check ttlSecondsAfterFinished ≥ 0 in both cases? Right now this command will create a RayJob with ttlSecondsAfterFinished: -10
kubectl ray --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n hyperkube job submit --name dxia-test --shutdown-after-job-finishes --ttl-seconds-after-finished -10 --working-dir . --dry-run -- python /home/ray/samples/sample_code.py | grep -B 1 ttlSeconds
submissionMode: InteractiveMode
ttlSecondsAfterFinished: -10
TTLSecondsAfterFinished: secondsValue, | ||
RayClusterSpec: &ClusterSpecTest.Spec, | ||
RuntimeEnvYAML: "mytest yaml", | ||
ShutdownAfterJobFinishes: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change apiserver tests?
@@ -171,6 +173,8 @@ func NewJobSubmitCommand(cmdFactory cmdutil.Factory, streams genericclioptions.I | |||
cmd.Flags().StringVar(&options.workerGPU, "worker-gpu", "0", "number of GPUs in each worker group replica") | |||
cmd.Flags().BoolVar(&options.dryRun, "dry-run", false, "print the generated YAML instead of creating the cluster. Only works when filename is not provided") | |||
cmd.Flags().BoolVarP(&options.verbose, "verbose", "v", false, "Passing the '--verbose' flag to the 'ray job submit' command") | |||
cmd.Flags().BoolVar(&options.shutdownAfterJobFinishes, "shutdown-after-job-finishes", false, "Shutdown the cluster after the job finishes") | |||
cmd.Flags().Int32Var(&options.ttlSecondsAfterFinished, "ttl-seconds-after-finished", 0, "TTL seconds after finished. Only works when filename is not provided") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still works when filename is provided right?
@@ -254,8 +258,17 @@ func (options *SubmitJobOptions) Validate() error { | |||
} | |||
options.runtimeEnvJson = string(runtimeJson) | |||
} | |||
|
|||
if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I think we should take the filename as a base, and flags should override the values in the file.
I created an issue to track this #3639
Why are these changes needed?
Currently shutdownAfterJobFinishes and ttlSecondsAfterFinished aren't being exposed in kubectl ray job submit, although they are available in the CRD.
Ray Operator
Add support for
shutdownAfterJobFinishes
andttlSecondsAfterFinished
in RayJob spec.Implement validation logic: if
shutdownAfterJobFinishes
is False, thenttlSecondsAfterFinished
must be 0 or unset.Add corresponding unit tests.
kubectl-plugin
Add
shutdownAfterJobFinishes
andttlSecondsAfterFinished
flags to the rayjob submit command.Implement validation logic: if
shutdownAfterJobFinishes
is False, thenttlSecondsAfterFinished
must be 0 or unset.Add corresponding unit tests.
Split TestRayJobSubmitValidate into two test cases: one for YAML-based submission and one for non-YAML-based submission.
Manual tests
Success case (
--shutdown-after-job-finishes
is given and--ttl-seconds-after-finished
set to 10)$ cd kubectl-plugin/test/e2e/ $ kubectl ray job submit --name rayjob-sample --runtime-env testdata/rayjob-submit-working-dir/runtime-env-sample.yaml --head-cpu 1 --head-memory 2Gi --worker-cpu 1 --worker-memory 2Gi --shutdown-after-job-finishes --ttl-seconds-after-finished 10 -- python entrypoint-python-sample.py
Failure case (
--shutdown-after-job-finishes
is not given)$ cd kubectl-plugin/test/e2e/ $ kubectl ray job submit --name rayjob-sample --runtime-env testdata/rayjob-submit-working-dir/runtime-env-sample.yaml --head-cpu 1 --head-memory 2Gi --worker-cpu 1 --worker-memory 2Gi --ttl-seconds-after-finished 10 -- python entrypoint-python-sample.py
Output
Error: ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true
Related issue number
Closes #3560
Checks