Skip to content

[Feature] [kubectl-plugin] Expose setting shutdownAfterJobFinishes and ttlSecondsAfterFinished in ray job submit #3627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

CheyuWu
Copy link
Contributor

@CheyuWu CheyuWu commented May 17, 2025

Why are these changes needed?

Currently shutdownAfterJobFinishes and ttlSecondsAfterFinished aren't being exposed in kubectl ray job submit, although they are available in the CRD.

Ray Operator

  • Add support for shutdownAfterJobFinishes and ttlSecondsAfterFinished in RayJob spec.

  • Implement validation logic: if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.

  • Add corresponding unit tests.

kubectl-plugin

  • Add shutdownAfterJobFinishes and ttlSecondsAfterFinished flags to the rayjob submit command.

  • Implement validation logic: if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.

  • Add corresponding unit tests.

  • Split TestRayJobSubmitValidate into two test cases: one for YAML-based submission and one for non-YAML-based submission.

Manual tests

Success case (--shutdown-after-job-finishes is given and --ttl-seconds-after-finished set to 10)

$ cd kubectl-plugin/test/e2e/
$ kubectl ray job submit --name rayjob-sample --runtime-env testdata/rayjob-submit-working-dir/runtime-env-sample.yaml --head-cpu 1 --head-memory 2Gi --worker-cpu 1 --worker-memory 2Gi --shutdown-after-job-finishes --ttl-seconds-after-finished 10 -- python entrypoint-python-sample.py

image

Failure case (--shutdown-after-job-finishes is not given)

$ cd kubectl-plugin/test/e2e/
$ kubectl ray job submit --name rayjob-sample --runtime-env testdata/rayjob-submit-working-dir/runtime-env-sample.yaml --head-cpu 1 --head-memory 2Gi --worker-cpu 1 --worker-memory 2Gi  --ttl-seconds-after-finished 10 -- python entrypoint-python-sample.py

Output

Error: ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true

Related issue number

Closes #3560

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Comment on lines +77 to +78
shutdownAfterJobFinishes bool
ttlSecondsAfterFinished int32
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add shutdownAfterJobFinishes and ttlSecondsAfterFinished

Comment on lines +262 to +265
if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 {
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement validation logic when --filename is provided

if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.

Comment on lines +268 to +271
} else if options.fileName == "" {
if !options.shutdownAfterJobFinishes && options.ttlSecondsAfterFinished != 0 {
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement validation logic when --filename is not provided

if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.

@@ -30,7 +30,7 @@ func TestRayJobSubmitComplete(t *testing.T) {
assert.Equal(t, "fake/path/to/env/yaml", fakeSubmitJobOptions.runtimeEnv)
}

func TestRayJobSubmitValidate(t *testing.T) {
func TestRayJobSubmitWithYamlValidate(t *testing.T) {
Copy link
Contributor Author

@CheyuWu CheyuWu May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split TestRayJobSubmitValidate into two test cases: one for YAML-based submission and one for non-YAML-based submission.

@@ -99,6 +122,52 @@ spec:
}
}

func TestRayJobSubmitWithoutYamlValidate(t *testing.T) {
Copy link
Contributor Author

@CheyuWu CheyuWu May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split TestRayJobSubmitValidate into two test cases: one for YAML-based submission and one for non-YAML-based submission.

Comment on lines +147 to +150
if !rayJob.Spec.ShutdownAfterJobFinishes && rayJob.Spec.TTLSecondsAfterFinished != 0 {
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished is non-zero")
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement validation logic in ray operator

if shutdownAfterJobFinishes is False, then ttlSecondsAfterFinished must be 0 or unset.

@CheyuWu
Copy link
Contributor Author

CheyuWu commented May 17, 2025

Hi @MortalHappiness and @kevin85421 PTAL

@CheyuWu CheyuWu force-pushed the feat/jobfinished branch from 639e1d1 to 88c16de Compare May 18, 2025 18:47
@CheyuWu CheyuWu force-pushed the feat/jobfinished branch from 28a6566 to 3d35117 Compare May 19, 2025 16:55
@kevin85421
Copy link
Member

cc @davidxia would you mind reviewing this PR? Thanks!

@@ -254,8 +258,17 @@ func (options *SubmitJobOptions) Validate() error {
}
options.runtimeEnvJson = string(runtimeJson)
}

if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 {
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to use the YAML field names ttlSecondsAfterFinished and shutdownAfterJobFinishes here

Suggested change
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true")
return fmt.Errorf("ttlSecondsAfterFinished is only supported when shutdownAfterJobFinishes is 'true'")

@@ -254,8 +258,17 @@ func (options *SubmitJobOptions) Validate() error {
}
options.runtimeEnvJson = string(runtimeJson)
}

if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this PR, but when I was testing this change, I ran kubectl ray job submit -f ray-job.sample.yaml --working-dir . --shutdown-after-job-finishes --ttl-seconds-after-finished 600 -- python /home/ray/samples/sample_code.py and wondered why the new flags weren't used. I had to look at this source to understand -f/--filename makes the command ignore many of these other flags.

Copy link
Member

@MortalHappiness MortalHappiness May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I think we should take the filename as a base, and flags should override the values in the file.

I created an issue to track this #3639

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of kubectl ray create cluster is to error with an actionable error message if declarative file and CLI flags are used together. We can consider doing that here too if we think it's not worth it to have override logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth overriding the values in the file. I will change this.

} else if strings.TrimSpace(options.rayjobName) == "" {
return fmt.Errorf("Must set either yaml file (--filename) or set Ray job name (--name)")
} else if options.fileName == "" {
if !options.shutdownAfterJobFinishes && options.ttlSecondsAfterFinished != 0 {
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("ttl-seconds-after-finished is only supported when shutdown-after-job-finishes is set to true")
return fmt.Errorf("--ttl-seconds-after-finished is only supported when --shutdown-after-job-finishes is set to true")

@@ -144,6 +144,10 @@ func ValidateRayJobSpec(rayJob *rayv1.RayJob) error {
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false is not allowed to be suspended")
}

if !rayJob.Spec.ShutdownAfterJobFinishes && rayJob.Spec.TTLSecondsAfterFinished != 0 {
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished is non-zero")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished is non-zero")
return fmt.Errorf("a RayJob with shutdownAfterJobFinishes set to false cannot have TTLSecondsAfterFinished")

} else if strings.TrimSpace(options.rayjobName) == "" {
return fmt.Errorf("Must set either yaml file (--filename) or set Ray job name (--name)")
} else if options.fileName == "" {
if !options.shutdownAfterJobFinishes && options.ttlSecondsAfterFinished != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check ttlSecondsAfterFinished ≥ 0 in both cases? Right now this command will create a RayJob with ttlSecondsAfterFinished: -10

kubectl ray --context gke_kubeflow-platform_europe-west4-b_ml-compute-1 -n hyperkube job submit --name dxia-test --shutdown-after-job-finishes --ttl-seconds-after-finished -10 --working-dir . --dry-run  -- python /home/ray/samples/sample_code.py | grep -B 1 ttlSeconds
  submissionMode: InteractiveMode
  ttlSecondsAfterFinished: -10

TTLSecondsAfterFinished: secondsValue,
RayClusterSpec: &ClusterSpecTest.Spec,
RuntimeEnvYAML: "mytest yaml",
ShutdownAfterJobFinishes: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change apiserver tests?

@@ -171,6 +173,8 @@ func NewJobSubmitCommand(cmdFactory cmdutil.Factory, streams genericclioptions.I
cmd.Flags().StringVar(&options.workerGPU, "worker-gpu", "0", "number of GPUs in each worker group replica")
cmd.Flags().BoolVar(&options.dryRun, "dry-run", false, "print the generated YAML instead of creating the cluster. Only works when filename is not provided")
cmd.Flags().BoolVarP(&options.verbose, "verbose", "v", false, "Passing the '--verbose' flag to the 'ray job submit' command")
cmd.Flags().BoolVar(&options.shutdownAfterJobFinishes, "shutdown-after-job-finishes", false, "Shutdown the cluster after the job finishes")
cmd.Flags().Int32Var(&options.ttlSecondsAfterFinished, "ttl-seconds-after-finished", 0, "TTL seconds after finished. Only works when filename is not provided")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still works when filename is provided right?

@@ -254,8 +258,17 @@ func (options *SubmitJobOptions) Validate() error {
}
options.runtimeEnvJson = string(runtimeJson)
}

if !options.RayJob.Spec.ShutdownAfterJobFinishes && options.RayJob.Spec.TTLSecondsAfterFinished != 0 {
Copy link
Member

@MortalHappiness MortalHappiness May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I think we should take the filename as a base, and flags should override the values in the file.

I created an issue to track this #3639

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants