A KEP for StartupPolicy #244

kannon92 · 2023-08-03T13:53:36Z

I wanted to draft a potential API for StartupPolicy as we have discussed it.

keps/104-StartupPolicy/README.md

kannon92 · 2023-08-08T13:25:17Z

/retest

kannon92 · 2023-08-08T19:45:47Z

/cc @vsoch @Gekko0114

Would this API fit your needs?

k8s-ci-robot · 2023-08-08T19:45:49Z

@kannon92: GitHub didn't allow me to request PR reviews from the following users: vsoch, Gekko0114.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @vsoch @Gekko0114

Would this API fit your needs?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vsoch · 2023-08-08T20:43:56Z

This would definitely be useful for us - what I've had to do is typically start the workers and handle race conditions with sleeps / waiting, that sort of thing.

kannon92 · 2023-08-09T13:48:28Z

This would definitely be useful for us - what I've had to do is typically start the workers and handle race conditions with sleeps / waiting, that sort of thing.

And an API like:

 startupPolicy:
  - name: driver
    status: Ready

Would be sufficent?

Gekko0114 · 2023-08-09T14:00:45Z

Thanks @kannon92 so much!

IMO, I prefer defining startUpPolicy under each replicatedJob field.
I think this implementation is easier to read.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: driver-ready-worker-start
spec:
  replicatedJobs:
  - name: driver
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0 
        completions: 1
        parallelism: 1
        template:
          spec:
            containers:
            - name: driver
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10000
  - name: workers
    replicas: 1
    startupPolicy:
    - name: driver
       status: Ready
    template:
      spec:
        backoffLimit: 0 
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10

FYI
I prefer Tekton's design, especially runAfter.
https://tekton.dev/docs/pipelines/pipelines/

apiVersion: tekton.dev/v1 # or tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: pipeline
spec:
  workspaces:
    - name: source
  tasks:
    - name: gen-code
      taskRef:
        name: gen-code # gen-code expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
    - name: commit
      taskRef:
        name: commit # commit expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
      runAfter:
        - gen-code

vsoch · 2023-08-09T16:49:11Z

IMO, I prefer defining startUpPolicy under each replicatedJob field. I think this implementation is easier to read.

I think he's mirroring SuccessPolicy-> TargetReplicatedJobs - the idea being that if there is more definition / attributes for such a policy, we have a place to put them and the ReplicatedJob itself doesn't become a single long list of loosely associated features.

jobset/api/jobset/v1alpha2/jobset_types.go

Line 177 in 61ad4b0

TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`

I think my preference is for that modular (and consistent, since we already have SuccessPolicy and FailurePolicy) approach, ensuring that each of those sections can be produced independently and then added to a replicatedJob.

And if we wanted to be more consistent it might look like:

type StartupPolicy struct {
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	Status JobReadyStatus `json:"status"`
}

This makes the assumption that a JobSet will have one condition (set of jobs and a status) to indicate ready. I'm not sure we want to allow more complexity than that (the workflow complexity thing is a non goal).

kannon92 · 2023-08-10T13:53:48Z

Thanks @kannon92 so much!

IMO, I prefer defining startUpPolicy under each replicatedJob field. I think this implementation is easier to read.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: driver-ready-worker-start
spec:
  replicatedJobs:
  - name: driver
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0 
        completions: 1
        parallelism: 1
        template:
          spec:
            containers:
            - name: driver
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10000
  - name: workers
    replicas: 1
    startupPolicy:
    - name: driver
       status: Ready
    template:
      spec:
        backoffLimit: 0 
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 10

@vsoch is correct. I was aiming to replicate a similar behavior that we have for SuccessPolicy and RestartPolicy. Those APIs are defined outside of the replicated job spec. I think its cleaner to do that as its all in one place. Hiding them in the ReplicatedJob spec seems a bit confusing.

FYI I prefer Tekton's design, especially runAfter. https://tekton.dev/docs/pipelines/pipelines/

apiVersion: tekton.dev/v1 # or tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: pipeline
spec:
  workspaces:
    - name: source
  tasks:
    - name: gen-code
      taskRef:
        name: gen-code # gen-code expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
    - name: commit
      taskRef:
        name: commit # commit expects a Workspace named "source"
      workspaces:
        - name: source # <- mapping workspace name
      runAfter:
        - gen-code

Yea, this is called out as a non-goal as I don't think this API will ever be complete in the context of workflows.

kannon92 · 2023-08-10T13:59:02Z

IMO, I prefer defining startUpPolicy under each replicatedJob field. I think this implementation is easier to read.

I think he's mirroring SuccessPolicy-> TargetReplicatedJobs - the idea being that if there is more definition / attributes for such a policy, we have a place to put them and the ReplicatedJob itself doesn't become a single long list of loosely associated features.

Correct.

jobset/api/jobset/v1alpha2/jobset_types.go

Line 177 in 61ad4b0

TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`

I think my preference is for that modular (and consistent, since we already have SuccessPolicy and FailurePolicy) approach, ensuring that each of those sections can be produced independently and then added to a replicatedJob.

And if we wanted to be more consistent it might look like:
type StartupPolicy struct {
	TargetReplicatedJobs []string `json:"targetReplicatedJobs,omitempty"`
	Status JobReadyStatus `json:"status"`
}
This makes the assumption that a JobSet will have one condition (set of jobs and a status) to indicate ready. I'm not sure we want to allow more complexity than that (the workflow complexity thing is a non goal).

Yea I believe there was an interest in both ready and succeeded. Init containers run after one another and the thought was that we could have a similar idea for this.

I ended up going with the API I have because of the flexibility of assuming ready or succeeded for the replicated jobs.

I do acknowledge the irony of calling out that we don't want a workflow engine but then we at least have a poor person workflow engine with this.

Gekko0114 · 2023-08-10T15:24:07Z

I commented without understanding the context of jobset, thank you for your explanations

keps/104-StartupPolicy/README.md

danielvegamyhre · 2023-08-15T17:25:56Z

keps/104-StartupPolicy/README.md

+When a JobSet is started with `StartupPolicy` specified, we will create jobs in a suspended state (ie `Job.Spec.Suspend = true`)  This avoids starting the underlying jobs.
+
+If StartupPolicy is set, then we will suspend the other jobs until the StartupPolicy is considered succesful.
+We will unsuspend only the jobs that match `TargetReplicatedJobStartup`.  Once they are considered "started" by matching on targeted status, then we resume all the rest of the jobs. 


If we unsuspend jobs matching TargetReplicatedJobStartup, then in the example where we want sequential execution of Job A -> Job B -> Job -> C, wouldn't they all run in parallel since there is a separate startup policy targeting each of them?

Or is this saying that we handle only 1 startup policy at a time, in the order in which they are specified in the spec (e.g. we first see a startup policy for Job A so we unsuspend it, wait for Job A to complete, then we check the next startup policy and it's for Job B, so we unsuspend that, etc.)?

Or is this saying that we handle only 1 startup policy at a time, in the order in which they are specified in the spec (e.g. we first see a startup policy for Job A so we unsuspend it, wait for Job A to complete, then we check the next startup policy and it's for Job B, so we unsuspend that, etc.)?

This is where I find it difficult balancing design versus implementation details.

In the implementation, I apply StartupPolicy in sequential order.

I would verify if JobA is suspended and then unsuspend. I would return and trigger a reconcile.

I loop over the startupPolicy rules and check if JobA is successful via the rules (Ready/Succeeded) and if it is, I move on to JobB and resume if its suspended.

And repeat.

If all the startup polices items are successful then we unsuspend everything not in the startup list (because the startup ones are running or succeeded).

updated with this info. PTAL

keps/104-StartupPolicy/README.md

kannon92 · 2023-12-18T22:16:15Z

/retest

keps/104-StartupPolicy/README.md

ahg-g · 2023-12-21T16:21:57Z

@kannon92 this is great, can you please resolve the remaining comments and get this merged?

keps/104-StartupPolicy/README.md

ahg-g · 2023-12-21T19:20:24Z

A couple final nits, otherwise looks good to me.

ahg-g · 2023-12-21T19:44:14Z

/lgtm
/approve

k8s-ci-robot · 2023-12-21T19:44:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kannon92 · 2023-12-21T20:03:50Z

/retest

danielvegamyhre · 2023-12-21T22:21:05Z

Thanks for driving this @kannon92! 🥳

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 3, 2023

k8s-ci-robot requested review from ahg-g and danielvegamyhre August 3, 2023 13:53

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 3, 2023

danielvegamyhre reviewed Aug 3, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Show resolved Hide resolved

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 3, 2023

This was referenced Aug 4, 2023

☂️ Requirements for v0.3.0 release #239

Closed

Startup policy implementation #246

Merged

add suspend to replicated job status #250

Merged

kannon92 changed the title ~~WIP: A KEP for StartupPolicy~~ A KEP for StartupPolicy Aug 8, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 8, 2023

kannon92 force-pushed the startup-policy-kpe branch from d5b2a6e to 55ecf7d Compare August 8, 2023 19:18

kannon92 requested a review from danielvegamyhre August 8, 2023 19:43

kannon92 mentioned this pull request Aug 9, 2023

boiler plate code for kep templating #252

Merged

kannon92 force-pushed the startup-policy-kpe branch from 55ecf7d to 661f1bb Compare August 9, 2023 18:12

danielvegamyhre reviewed Aug 15, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved

danielvegamyhre reviewed Aug 15, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved

danielvegamyhre reviewed Aug 15, 2023

View reviewed changes

kannon92 force-pushed the startup-policy-kpe branch from c3a70cd to 84235b3 Compare December 12, 2023 22:50

ahg-g reviewed Dec 16, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved

kannon92 force-pushed the startup-policy-kpe branch 3 times, most recently from cb80a39 to 9d855b9 Compare December 18, 2023 21:13

alculquicondor reviewed Dec 18, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved

keps/104-StartupPolicy/README.md Show resolved Hide resolved

ahg-g reviewed Dec 18, 2023

View reviewed changes

kannon92 force-pushed the startup-policy-kpe branch from 9d855b9 to c03cf4d Compare December 18, 2023 21:48

ahg-g reviewed Dec 18, 2023

View reviewed changes

kannon92 force-pushed the startup-policy-kpe branch from c03cf4d to 4f8a19e Compare December 19, 2023 22:37

ahg-g reviewed Dec 21, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved

kannon92 mentioned this pull request Dec 21, 2023

Add Finalizers to Jobs so ReplicatedJobStatus can be updated before jobs are removed. #360

Open

add kep for startuppolicy

bbd6573

kannon92 force-pushed the startup-policy-kpe branch from 4f8a19e to bbd6573 Compare December 21, 2023 16:44

kannon92 requested a review from ahg-g December 21, 2023 16:45

ahg-g reviewed Dec 21, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved

ahg-g reviewed Dec 21, 2023

View reviewed changes

keps/104-StartupPolicy/README.md Outdated Show resolved Hide resolved

update kep feedback

9ed5bd4

k8s-ci-robot assigned ahg-g Dec 21, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 21, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 21, 2023

k8s-ci-robot merged commit dbabc65 into kubernetes-sigs:main Dec 21, 2023
10 checks passed

danielvegamyhre mentioned this pull request Jan 22, 2024

Declare dependency relation between the jobs in a jobset #296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A KEP for StartupPolicy #244

A KEP for StartupPolicy #244

kannon92 commented Aug 3, 2023

kannon92 commented Aug 8, 2023

kannon92 commented Aug 8, 2023

k8s-ci-robot commented Aug 8, 2023

vsoch commented Aug 8, 2023

kannon92 commented Aug 9, 2023

Gekko0114 commented Aug 9, 2023

vsoch commented Aug 9, 2023 •

edited

Loading

kannon92 commented Aug 10, 2023

kannon92 commented Aug 10, 2023

Gekko0114 commented Aug 10, 2023

danielvegamyhre Aug 15, 2023

kannon92 Aug 16, 2023

kannon92 Aug 16, 2023

kannon92 commented Dec 18, 2023

ahg-g commented Dec 21, 2023

ahg-g commented Dec 21, 2023

ahg-g commented Dec 21, 2023

k8s-ci-robot commented Dec 21, 2023

kannon92 commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023

A KEP for StartupPolicy #244

A KEP for StartupPolicy #244

Conversation

kannon92 commented Aug 3, 2023

kannon92 commented Aug 8, 2023

kannon92 commented Aug 8, 2023

k8s-ci-robot commented Aug 8, 2023

vsoch commented Aug 8, 2023

kannon92 commented Aug 9, 2023

Gekko0114 commented Aug 9, 2023

vsoch commented Aug 9, 2023 • edited Loading

kannon92 commented Aug 10, 2023

kannon92 commented Aug 10, 2023

Gekko0114 commented Aug 10, 2023

danielvegamyhre Aug 15, 2023

Choose a reason for hiding this comment

kannon92 Aug 16, 2023

Choose a reason for hiding this comment

kannon92 Aug 16, 2023

Choose a reason for hiding this comment

kannon92 commented Dec 18, 2023

ahg-g commented Dec 21, 2023

ahg-g commented Dec 21, 2023

ahg-g commented Dec 21, 2023

k8s-ci-robot commented Dec 21, 2023

kannon92 commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023

vsoch commented Aug 9, 2023 •

edited

Loading